开发者

Take first successful match from a batch of regexes

开发者 https://www.devze.com 2023-01-14 16:24 出处:网络
I\'m trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.

regexes = [
    compiled_regex_1,
    compiled_regex_2,
    compiled_regex_3,
]

m = None
for reg in regexes:
    m = reg.match(name)
    if m: break

if not m:
    print 'ARGL NOTHING MATCHES THIS!!!'

This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?

There might be something specific to re that I don't know about that allows you to test multiple patte开发者_运维百科rns too.


You can use the else clause of the for loop:

for reg in regexes:
    m = reg.match(name)
    if m: break
else:
    print 'ARGL NOTHING MATCHES THIS!!!'


If you just want to know if any of the regex match then you could use the builtin any function:

if any(reg.match(name) for reg in regexes):
     ....

however this will not tell you which regex matched.

Alternatively you can combine multiple patterns into a single regex with |:

regex = re.compile(r"(regex1)|(regex2)|...")

Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:

>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)

However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.

This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.


Since you have a finite set in this case, you could use short ciruit evaluation:

m = compiled_regex_1.match(name) or
    compiled_regex_2.match(name) or
    compiled_regex_3.match(name) or
    print("ARGHHHH!")


In Python 2.6 or better:

import itertools as it

m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)

The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):

m = next((m for r in regexes for m in (r.match(name),) if m), None)

but itertools is generally preferable where applicable.

The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,

def next(itr, deft):
  try: return itr.next()
  except StopIteration: return deft


I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.

regexps = {
  'first': r'...',
  'second': r'...',
}

compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup


Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.

Then my alternative:

# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
    for item in seq:
        if item: return item

regexes = [
    compiled_regex_1,
    compiled_regex_2,
    compiled_regex_3,
]

m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')
0

精彩评论

暂无评论...
验证码 换一张
取 消