开发者

How to match--but not capture--in Python regular expressions?

开发者 https://www.devze.com 2023-03-24 05:04 出处:网络
I\'ve got a function spitting out \"Washington D.C., DC, USA\" as output. I need to capture \"Washington, DC\" for reasons that have to do with how I handle every single other city in the country. (No

I've got a function spitting out "Washington D.C., DC, USA" as output. I need to capture "Washington, DC" for reasons that have to do with how I handle every single other city in the country. (Note: this is not the same as "D.C.", I need the comma to be between "Washington" and "DC", whitespace is fine)

I can't for the life of me figure out how to capture this.

Here's what I've tried:

    >>>location = "Washington D.C., DC, USA"

    >>>match = re.search(r'\w+\s(?:D\.C\.), \w\w(?=\W)', location).group()
    >>>match
    u'Washington D.C., DC'

Is not (?: ...) supposed to just match (and not capture) "D.C."?

Here are the 2.7.2 Docs:

(?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

What gives??

Thanks in ad开发者_Go百科vance!


That's a clever way indeed, but not-capturing doesn't mean removing it from match. It just mean, that it's not considered as an output group.

You should try to do something similar to the following:

match = re.search(r'(\w+)\s(?:D\.C\.), (\w\w)\W', location).groups()

This prints ('Washington', 'DC').

Note the difference between .group() and .groups(). The former gives you the whole string that was matched, the latter only the captured groups. Remember, you need to specify what you want to include in the output, not what you want to exclude.


matches = re.search(r'(\w+\s)(?:D\.C\.)(, \w\w)(?=\W)', location).group(1,2)
match = ''.join(matches)

When it says it is "non-capturing", that means it won't make a separate captured group for it. The text "D.C." is still in the match. See http://docs.python.org/library/re.html#match-objects


I'm late to this and the first two answers were great, but if by any chance you need a general regex for pulling cities out of a combination of cities, suffixes, states/provinces, and countries, but you know D.C. is an annoying special case, you might be able to use the following:

>>> import re
>>> city = re.compile(r'(\w+(?:\sD\.C\.)?), \w\w(?=\W)')
>>> location = "Washington D.C., DC, USA"
>>> re.search(city, location).group(1)
'Washington D.C.'
>>> location = "Vancouver, BC, Canada"
>>> re.search(city, location).group(1)
'Vancouver'

The D.C. part is made optional (as you don't always need it) in addition to being non-capturing (to save memory).

0

精彩评论

暂无评论...
验证码 换一张
取 消