In a translation-testing app (in Python) I want a regular expression that will accept either of these two strings:
a = "I want the red book"
b = "the book which I want is red"
So far I'm using something like this:
^(the book which )*I want (is |the )red (book)*$
This will accept both string a and string b. But it will also accept a string without either of the two optional sub-strings:
sub1 = (the book which )
sub2 = (book)
How can I indicate that one of these two substrings must be present, even though they're not adjacent?
I realize that in this example it would be trivially easy to avoid the problem by just testing for longer alternatives separated by "or" |
. This is a simplified example of a problem that is harder to avoid with the actual user input I'm working with.
How can I indicate that one of these two substrings must be present, even though they're not adjacent?
I am assuming that is the core question you have.
The solution is two regex's. Why people feel that once the say import re
that the regex has to be a single line is just beyond me.
First test for the first substring in one regex, then test for the other substring with another regex. Logically combine those two results.
This looks like a problem that might be better solved with a difflib.SequenceMatcher than with regular expressions.
However, a regular expression that works for the specific example in the original question is as follows:
^(the book which )*I want (is |the )red((?(1)(?: book)*| book))$
This will fail for the string "I want the red" (which lacks both of the required substrings "the books which " and " book"). This uses the (?(id/name)yes-pattern|no-pattern) syntax which allows for alternatives based on the existence of a previously matched group.
import re
regx1 = re.compile('^(the book which )*I want (is |the )red' '((?(1)|(?: book)))$')
regx2 = re.compile('^(the book which )*I want (is |the )red' '((?(1)(?: book)*|(?: book)))$')
for x in ("I want the red book",
"the book which I want is red",
"I want the red",
"the book which I want is red book"):
print x
print regx1.search(x).groups() if regx1.search(x) else 'No match'
print regx2.search(x).groups() if regx2.search(x) else 'No match'
print
result
I want the red book
(None, 'the ', ' book')
(None, 'the ', ' book')
the book which I want is red
('the book which ', 'is ', '')
('the book which ', 'is ', '')
I want the red
No match
No match
the book which I want is red book
No match
('the book which ', 'is ', ' book')
edit
Your regex pattern
^(the book which )*I want (is |the )red (book)*$
doesn't match correctly for all the sentences because of the last blank in it.
It must be
'^(the book which )*I want (is |the )red( book)*$'
精彩评论