开发者

Python: defining a union of regular expressions

开发者 https://www.devze.com 2023-01-07 23:40 出处:网络
I have a list of patterns like list_patterns = [\': error:\', \': warning:\', \'cc1plus:\', \'undefine reference to\']

I have a list of patterns like

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']

what I want to do is to produce a union of all of them yielding a regular expression that matches every element in list_patterns [but presumably does not match an开发者_运维技巧y re not in list_patterns -- msw]

re.compile(list_patterns)

Is this possible?


There are a couple of ways of doing this. The simplest is:

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = 'there is an : error: and a cc1plus: in this string'
print re.findall('|'.join(list_patterns), string)

Output:

[': error:', 'cc1plus:']

which is fine as long as concatenating your search patterns doesn't break the regex (eg if one of them contains a regex special character like an opening parenthesis). You can handle that this way:

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = 'there is an : error: and a cc1plus: in this string'
pattern = "|".join(re.escape(p) for p in list_patterns)
print re.findall(pattern, string)

Output is the same. But what this does is pass each pattern through re.escape() to escape any regex special characters.

Now which one you use depends on your list of patterns. Are they regular expressions and can thus be assumed to be valid? If so, the first would probably be appropriate. If they are strings, use the second method.

For the first, it gets more complicated however because by concatenating several regular expressions you may change the grouping and have other unintended side effects.


list_regexs = [re.compile(x) for x in list_patterns]


How about

ptrn = re.compile('|'.join(re.escape(e) for e in list_patterns))

Note the use of re.escape() to avoid unintended consequences by presence of characters like ()[]|.+* etc in some of the strings. Assuming you want that, otherwise skip the escape().

It also depends how do you intend to 'consume' that expression - is it only for search of a match or would you like to collect the matching groups back?


You want a pattern that matches any item in the list? Wouldn't that just be:

': error:|: warning:|cc1plus:|undefine reference to'?

Or, in Python code:

re.compile("|".join(list_patterns))


Cletus gave a very good answer. If however one of the strings to match could be a substring of another, then you would to reverse sort the strings first so that shortest matches do not obscure longer ones.

If, as Alex has noted, the original poster wanted what he actually asked for, then a more tractable solution than using permutations might be to:

  • Remove any duplicates in list_patterns. (It could be better off starting with a set then turning it into a reverse-sorted list without duplicates).
  • re.escape() the items of the list.
  • Surround each item in individually a group (... ).
  • '|'.join() all the groups.
  • Find the set of the indices of all groups that matched, and compare its length with len(list_patterns).

If there is at least one match for every entry in your original list of strings, then the length of the set should match.

The code would be something like:

import re

def usedgroupindex(indexabledata):
    for i,datum in enumerate(indexabledata):
        if datum: return i
    # return None

def findallstrings(list_patterns, string):
    lp = sorted(set(list_patterns), reverse=True)
    pattern = "|".join("(%s)" % re.escape(p) for p in lp)
    # for m in re.findall(pattern, string): print (m, usedgroupindex(m))
    return ( len(set(usedgroupindex(m) for m in re.findall(pattern, string)))
             == len(lp) )

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = ' XZX '.join(list_patterns)

print ( findallstrings(list_patterns, string) )


a regular expression that matches every element in the list

I see you've already got several answers based on the assumption that by "matches every element in the list" you actually meant "matches any element in the list" (the answers in questions are based on the | "or" operator of regular expressions).

If you actually do want a RE to match every element of the list (as opposed to any single such element), then you might want to match them either in the same order as the list gives them (easy), or, in any order whatever (hard).

For in-order matching, '.*?'.join(list_patterns) should serve you well (if the items are indeed to be taken as RE patterns - if they're to be taken as literal strings instead, '.*?'.join(re.escape(p) for p list_patterns)).

For any-order matching, regular expressions, per se, offer no direct support. You could take all permutations of the list (e.g. with itertools.permutations), join up each of them with '.*?', and join the whole with | -- but that can produce a terribly long RE pattern as a result, because the number of permutations of N items is N! ("N factorial" -- for example, for N equal 4, the permutations are 4 * 3 * 2 * 1 == 24). Performance may therefore easily suffer unless the number of items in the list is known to be very, very small.

For a more general solution to the "match every item in arbitrary order" problem (if that's what you need), one with a performance and memory footprint that's still acceptable for decently large lengths of list, you need to give up on the target of making it all work with a single RE object, and inject some logic in the mix -- for example, make a list of RE objects with relist=[re.compile(p) for p in list_patterns], and check "they all match string s, in any order" with all(r.search(s) for r in relist) or the like.

Of course, if you need to make this latest approach work in a "duck-typing compatible way" with actual RE objects, that's not hard, e.g., if all you need is a search method which returns a boolean result (since returning a "match object" would make no sense)...:

class relike(object):
    def __init__(self, list_patterns):
        self.relist = [re.compile(p) for p in list_patterns]
    def search(self, s):
        return all(r.search(s) for r in relist)
0

精彩评论

暂无评论...
验证码 换一张
取 消