开发者

Why does re.sub in Python not work correctly on this test case?

开发者 https://www.devze.com 2023-01-27 11:35 出处:网络
Try this code. test = \' az z bz z z stuff zz \' re.sub(开发者_如何学Pythonr\'(\\W)(z)(\\W)\', r\'\\1_\\2\\3\', test)

Try this code.

test = ' az z bz z z stuff z  z '
re.sub(开发者_如何学Pythonr'(\W)(z)(\W)', r'\1_\2\3', test)

This should replace all stand-alone z's with _z

However, the result is:

' az _z bz _z z stuff _z _z '

You see there's a z there that is missing. I theorize that it's because the grouping can't grab the space between the z's to match two z's at once (one for trailing whitespace, one for leading whitespace). Is there a way to fix this?


If your goal is to make sure you only match z when it's a standalone word, use \b to match word boundaries without actually consuming the whitespace:

>>> re.sub(r'\b(z)\b', r'_\1', test)
' az _z bz _z _z stuff _z  _z '


You want to avoid capturing the whitespace. Try using the 0-width word break \b, like this:

re.sub(r'\bz\b', '_z', test)


The reason why it does that is that you get an overlapping match; you need to not match the extra character - there are two ways you can do this; one is using \b, the word boundary, as suggested by others, the other is using a lookbehind assertion and a lookahead assertion. (If reasonable, as it should probably be, use \b instead of this solution. This is mainly here for educational purposes.)

>>> re.sub(r'(?<!\w)(z)(?!\w)', r'_\1', test)
' az _z bz _z _z stuff _z  _z '

(?<!\w) makes sure there wasn't \w before.

(?!\w) makes sure there isn't \w after.

The special (?...) syntax means they aren't groups, so the (z) is \1.


As for a graphical explanation of why it fails:

The regex is going through the string doing replacement; it's at these three characters:

' az _z bz z z stuff z  z '
          ^^^

It does that replacement. The final character has been acted upon, so its next step is approximately this:

' az _z bz _z z stuff z  z '
              ^^^ <- It starts matching here.
             ^ <- Not this character, it's been consumed by the last match


Use this:

test = ' az z bz z z stuff z  z '
re.sub(r'\b(z)\b', r'_\1', test)
0

精彩评论

暂无评论...
验证码 换一张
取 消