开发者

Python regex with unicode characters bug?

开发者 https://www.devze.com 2023-01-14 00:38 出处:网络
Long story short: >>> re.compile(r\"\\w*\").match(u\"Français\") <_sre.SRE_Match object at 0x1004246b0>

Long story short:

>>> re.compile(r"\w*").match(u"Français")
<_sre.SRE_Match object at 0x1004246b0>
>>> re.compile(r"^\w*$").match(u"Français")
>>> re.compile(r"^\w*$").match(u"Franais")
<_sre.SRE_Match object at 0x100424780>
>>> 

Why doesn't it match the string with unicode characters with ^ and $ in t开发者_如何学JAVAhe regex? As far as I understand ^ stands for the beginning of the string(line) and $ - for the end of it.


You need to specify the UNICODE flag, otherwise \w is just equivalent to [a-zA-Z0-9_], which does not include the character 'ç'.

>>> re.compile(r"^\w*$", re.U).match(u"Fran\xe7ais")
<_sre.SRE_Match object at 0x101474168>
0

精彩评论

暂无评论...
验证码 换一张
取 消