I'm trying to use a regex as an input, and from there generate all the possible values that the regex would match.
So, for example, if the regex is "three-letter words starting with a, and ending in c," then the code would generate a 开发者_JAVA技巧list with the values [aac, abc, acc, adc, a1c....].
Is there an easy way to do this? I'm using python.
Here's a brute force solution that should work. It has a running time of O(L^max_length) (where L is the size of the alphabet), so use it at your own risk.
def all_matching_strings(alphabet, max_length, regex):
"""Find the list of all strings over 'alphabet' of length up to 'max_length' that match 'regex'"""
if max_length == 0: return
L = len(alphabet)
for N in range(1, max_length+1):
indices = [0]*N
for z in xrange(L**N):
r = ''.join(alphabet[i] for i in indices)
if regex.match(r):
yield(r)
i = 0
indices[i] += 1
while (i<N) and (indices[i]==L):
indices[i] = 0
i += 1
if i<N: indices[i] += 1
return
example usage:
alphabet = 'abcdef1234567890'
import re
regex = re.compile('f*[1-3]+$')
for r in all_matching_strings(alphabet, 5, regex):
print r
which would output all strings up to length 5, starting with a sequence of f's, and then a non empty sequence of 1-3, then ending:
1
2
3
f1
11
21
31
f2
12
22
32
f3
13
23
33
ff1
[more output omitted...]
You don't want to do this. Most of the result sets will be huge, and some will be infinite. Instead use a sequence of test vectors and apply the regex against each in turn:
vectors = (
'foo',
'bar',
...
)
for result in (re.match(someregex, entry) for entry in vectors):
...
The set of matching strings is infinite if and only if there is a quantifier (+ or *) in your regexp. Your question doesn't seem to aim at those patterns. I rather believe that the product function from itertools
might help here.
You might for instance introduce a special character indicating an arbitrary letter (e.g. an underscore), then build a pattern like this
patt = 'a_c'
and define your alphabet
youralphabet = 'abcde...'
and define a function generating all possible instances like this
def genInstances(patt):
elems = [c if c != '_' else youralphabet for c in patt]
return itertools.product(*elems)
You may then extend this approach to match real regexp by parsing your pattern for \d
or [a-zA-Z]
or whatever.
Some regular expressions match a finite number of input strings, but many (most?) match an infinite number of input strings. It's kind of like asking 'given the python language grammar, generate all possible python programs'. You probably could write a program to list them all in sequence if you tried (though it would take infinite time to run), but are you sure you want to? Why would you want to?
I'm pretty sure the regular expression engine in the standard library does not expose a way to generate the output you want. You'd have to get lower level access to the internal data structures, or implement some DFA engine thing-a-ma-bob yourself.
精彩评论