Why my regex with r'string' matches but not 'string' using Python?_问答_开发者

The way regex works in Python is so intensely puzzling that it makes me more furious with each passing second. Here's my problem:

I understand that this gives a result:

re.search(r'\bmi\b', 'grand rapids, mi 49505)

while this doesn't:

re.search('\bmi\b', 'grand rapids, mi 49505)

And that's okay. I get that much of it. Now, I have a regular expression that's being 开发者_运维知识库generated like this:

regex = '|'.join(['\b' + str(state) + '\b' for state in states])

If I now do re.search(regex, 'grand rapids, mi 49505'), it fails for the same reason my second search() example fails.

My question: Is there any way to do what I'm trying to do?

The anwser itself

regex = '|'.join([r'\b' + str(state) + r'\b' for state in states])

The reason behind this is that the 'r' prefix tells Python to not analyze the string you pass to it. If you don't put an 'r' before the string, Python will try to turn any char preceding by '\' into a special char, to allow you to enter break lines (\n), tabs (\t) and such easily.

When you do '\b', you tell Python to create a string, analyse it, and transform '\b' into 'backspace', while when you do r'\b', Python just store '\' then 'b', and this is what you want with for regex. Always use 'r' for string used as regex patterns.

The 'r' notation is called 'raw string', but that's misleading, as there is no such thing as a raw string in Python internals. Just think about it as a way to tell Python to avoid being too smart.

There is another notation in Python < 3.0, u'string', that tells Python to store the string as unicode. You can combine both: ur"é\n" will store "\bé" as unicode, while u"é\n" will store "é" then a line break.

Some ways to improve your code:

regex = '|'.join(r'\b' + str(state) + r'\b' for state in states)

Removed the extra []. It tells Python to not store in memory the list of values you are generating. We can do it here because we don't plan to reuse the list you are creating since you use it directly in your join() and nowhere else.

regex = '|'.join(r'\b%s\b' % state for state in states)

This will take care of the string conversion automatically and is shorter and cleaner. When you format string in Python, think about the % operator.

If states contain a list of states zip code, then there should be stored as string, not as int. In that case, you can skip the type casting and shorten it even more:

regex = r'\b%s\b' % r'\b|\b'.join(states)

Eventually, you may not need regex at all. If all you care is to check if one of the zip code is in the given string, you can just use in (check if an item is in an iterable, like if a string is in a list):

matches = [s for s in states if s in 'grand rapids, mi 49505']

Last word

I understand you may be frustrated when learning a new language, but take the time to give a proper title to your question. In this website, the title should end with a question mark and give specific details about the problem.

The solution is the one you used yourself in the example above: raw strings.

regex = '|'.join(r'\b' + str(state) + r'\b' for state in states)

(Note that I also removed the extra brackets, turning the list comprehension into a generator expression.)

The key is understanding the difference between '\b' and r'\b'. Typing these in IDLE results in this output:

>>> '\b'
'\x08'
>>> r'\b'
'\\b'

So whenever you type in a backslash in a regex, you should escape it by using raw string notation.

Let's break these two strings down:

r'\bmi\b'

Python interprets the above string as six characters long (backslash, letter B, etc.). A raw string suppresses Python's translation of \b into a backspace.

re interprets the two characters \ and b as a word break.

'\bmi\b'

Python interprets the above string as four characters long (backspace, letter B, etc.).
re now sees nothing special to interpret and looks for those literal four characters.

So the construction below is looking for backspaces, not word breaks:

regex = '|'.join(['\b' + str(state) + '\b' for state in states])

Try this (dropping str, state should already be a string):

regex = '|'.join([r'\b' + state + r'\b' for state in states])

The word break doesn't need to be processed in every OR expression. Pulling it out simplifies the join:

regex = r'\b(' + '|'.join(states) + r')\b'

Since Pythonistas usually frown on regexes, might as well make a readable one:

import re

pattern = re.compile(r'''
    (?ix) # ignore case, verbose
    \b    # word break
    (     # begin group 1
    AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|
    HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|
    MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|
    NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|
    SD|TN|TX|UT|VT|VA|WA|WV|WI|WY
    )     # end group 1
    \b    # word break
    ''')

m = pattern.search('Grand Rapids, MI 49505')
if m:
    print m.group(1)