I'm just wondering if it's possible to use one regular expression to match another, that is some sort of:
['a-z'].match(['b-x'])
True
['m-n'].match(['0-9'])
False
Is this sort of thing possible with regex at all? I'm doing work in python, so any advice specific to the re
module's implementation would help, but I'll take anything I can get concerning regex.
Edit: Ok, some clarification is obviously in order! I definitely know that normal matching syntax would look something like this:
expr = re.compile(r'[a-z]*')
string = "some words"
expr.match(string)
<sRE object blah blah>
but I'm wondering if regular expressions have the capabilit开发者_如何学编程y to match other, less specific expressions in the non-syntacticly correct version I tried to explain with above, any letter from b-x would always be a subset (match) of any letter from a-z. I know just from trying that this isn't something you can do by just calling the match of one compiled expression on another compiled expression, but the question remains: is this at all possible?
Let me know if this still isn't clear.
I think — in theory — to tell whether regexp A
matches a subset of what regexp B
matches, an algorithm could:
- Compute the minimal Deterministic Finite Automaton of
B
and also of the "union"A|B
. - Check if the two DFAs are identical. This is true if and only if A matches a subset of what B matches.
However, it would likely be a major project to do this in practice. There are explanations such as Constructing a minimum-state DFA from a Regular Expression but they only tend to consider mathematically pure regexps. You would also have to handle the extensions that Python adds for convenience. Moreover, if any of the extensions cause the language to be non-regular (I am not sure if this is the case) you might not be able to handle those ones.
But what are you trying to do? Perhaps there's an easier approach...?
Verification of the post by "antinome" using two regex : 55* and 5* :
REGEX_A: 55* [This matches "5", "55", "555" etc. and does NOT match "4" , "54" etc]
REGEX_B: 5* [This matches "", "5" "55", "555" etc. and does NOT match "4" , "54" etc]
[Here we've assumed that 55* is not implicitly .55.* and 5* is not .5.* - This is why 5* does not match 4]
REGEX_A can have an NFA as below:
{A}--5-->{B}--epsilon-->{C}--5-->{D}--epsilon-->{E}
{B} -----------------epsilon --------> {E}
{C} <--- epsilon ------ {E}
REGEX_B can have an NFA as below:
{A}--epsilon-->{B}--5-->{C}--epsilon-->{D}
{A} --------------epsilon -----------> {D}
{B} <--- epsilon ------ {D}
Now we can derive NFA * DFA of (REGEX_A|REGEX_B) as below:
NFA:
{state A} ---epsilon --> {state B} ---5--> {state C} ---5--> {state D}
{state C} ---epsilon --> {state D}
{state C} <---epsilon -- {state D}
{state A} ---epsilon --> {state E} ---5--> {state F}
{state E} ---epsilon --> {state F}
{state E} <---epsilon -- {state F}
NFA -> DFA:
| 5 | epsilon*
----+--------------+--------
A | B,C,E,F,G | A,C,E,F
B | C,D,E,F | B,C,E,F
c | C,D,E,F | C
D | C,D,E,F,G | C,D,E,F
E | C,D,E,F,G | C,E,F
F | C,E,F,G | F
G | C,D,E,G | C,E,F,G
5(epsilon*)
-------------+---------------------
A | B,C,E,F,G
B,C,E,F,G | C,D,E,F,G
C,D,E,F,G | C,D,E,F,G
Finally the DFA for (REGEX_A|REGEX_B) is:
{A}--5--->{B,C,E,F,G}--5--->{C,D,E,F,G}
{C,D,E,F,G}---5--> {C,D,E,F,G}
Note: {A} is start state and {C,D,E,F,G} is accepting state.
Similarly DFA for REGEX_A (55*) is:
| 5 | epsilon*
----+--------+--------
A | B,C,E | A
B | C,D,E | B,C,E
C | C,D,E | C
D | C,D,E | C,D,E
E | C,D,E | C,E
5(epsilon*)
-------+---------------------
A | B,C,E
B,C,E | C,D,E
C,D,E | C,D,E
{A} ---- 5 -----> {B,C,E}--5--->{C,D,E}
{C,D,E}--5--->{C,D,E}
Note: {A} is start state and {C,D,E} is accepting state
Similarly DFA for REGEX_B (5*) is:
| 5 | epsilon*
----+--------+--------
A | B,C,D | A,B,D
B | B,C,D | B
C | B,C,D | B,C,D
D | B,C,D | B,D
5(epsilon*)
-------+---------------------
A | B,C,D
B,C,D | B,C,D
{A} ---- 5 -----> {B,C,D}
{B,C,D} --- 5 ---> {B,C,D}
Note: {A} is start state and {B,C,D} is accepting state
Conclusions:
DFA of REGX_A|REGX_B identical to DFA of REGX_A
-- implies REGEX_A is subset of REGEX_B
DFA of REGX_A|REGX_B is NOT identical to DFA of REGX_B
-- cannot infer about either gerexes.
In addition to antinome's answer:
Many of the constructs that are not part of the basic regex definition are still regular, and can be converted after parsing the regex (with a real parser, because the language of regex is not regular itself): (x?)
to (x|)
, (x+)
to (xx*)
, character classes like [a-d]
to their corresponding union (a|b|c|d)
etc. So one can use these constructs and still test whether one regex matches a subset of the other regex using the DFA comparison mentioned by antinome.
Some constructs, like back references, are not regular, and cannot be represented by NFA or DFA.
Even the seemingly simple problem of testing whether a regex with back references matches a particular string is NP-complete (http://perl.plover.com/NPC/NPC-3COL.html).
pip3 install https://github.com/leafstorm/lexington/archive/master.zip
python3
>>> from lexington.regex import Regex as R
>>> from lexington.regex import Null
>>> from functools import reduce
>>> from string import ascii_lowercase, digits
>>> a_z = reduce(lambda a, b: a | R(b), ascii_lowercase, Null)
>>> b_x = reduce(lambda a, b: a | R(b), ascii_lowercase[1:-2], Null)
>>> a_z | b_x == a_z
True
>>> m_n = R("m") | R("n")
>>> zero_nine = reduce(lambda a, b: a | R(b), digits, Null)
>>> m_n | zero_nine == m_n
False
Also tested successfully with Python 2. See also how to do it with a different library.
Alternatively, pip3 install https://github.com/ferno/greenery/archive/master.zip
and:
from greenery.lego import parse as p
a_z = p("[a-z]")
b_x = p("[b-x]")
assert a_z | b_x == a_z
m_n = p("m|n")
zero_nine = p("[0-9]")
assert not m_n | zero_nine == m_n
You should do something along these lines:
re.match("\[[b-x]-[b-x]\]", "[a-z]")
The regular expression has to define what the string should look like. If you want to match an opening square bracket followed by a letter from b to x, then a dash, then another letter from b to x and finally a closing square bracket, the solution above should work.
If you intend to validate that a regular expression is correct you should consider testing if it compiles instead.
It's possible with the string representation of a regex, since any string can be matched with regexes, but not with the compiled version returned by re.compile
. I don't see what use this would be, though. Also, it takes a different syntax.
Edit: you seem to be looking for the ability to detect whether the language defined by an RE is a subset of another RE's. Yes, I think that's possible, but no, Python's re
module doesn't do it.
Some clarification is required, I think:
.
rA = re.compile('(?<! )[a-z52]+')
'(?<! )[a-z52]+'
is a pattern
rA
is an instance of class RegexObject whose type is < *type '_sre.SRE_Pattern' >* .
Personally I use the term regex exclusively for this kind of objects, not for the pattern .
Note that rB = re.compile(rA)
is also possible, it produces the same object (id(rA) == id(rB) equals to True)
ch = 'lshdgbfcs luyt52uir bqisuytfqr454'
x = rA.match(ch)
# or
y = rA.search(ch)
x and y are instances of the class MatchObject whose type is *< type '_sre.SRE_Match' >*
.
That said, you want to know if there a way to determine if a regex rA can match all the strings matched by another regex rB while rB matches only a subsest of all the strings matched by rA.
I don't think such a way exists, whatever theoretically or practically.
精彩评论