开发者

Intersection of two regular expressions

开发者 https://www.devze.com 2023-01-01 22:07 出处:网络
Im looking for function (PHP will be the best), which returns true whether exists string matches both regexpA and regexpB.

Im looking for function (PHP will be the best), which returns true whether exists string matches both regexpA and regexpB.

Example 1:

$regexpA = '[0-9]+';
$regexpB = '[0-9]{2,3}';

hasRegularsIntersection($regexpA,$regexpB) returns TRUE because '12' matches both regexps

Example 2:

$regexpA = '[0-9]+';
$regexpB = '[a-z]+';

hasRegularsIntersection($regexpA,$regexpB) returns FALSE because numbe开发者_如何学Gors never matches literals.

Thanks for any suggestions how to solve this.

Henry


For regular expressions that are actually regular (i.e. don't use irregular features like back references) you can do the following:

  1. Transform the regexen into finite automata (the algorithm for that can be found here(chapter 9) for example).
  2. Build the intersection of the automata (You have a state for each state in the cartesian product of the states of the two automata. You then transition between the states according to the original automata's transition rules. E.g. if you're in state x1y2, you get the input a, the first automaton has a transition x1->x4 for input x and the second automaton has y2->y3, you transition into the state x4y3).
  3. Check whether there's a path from the start state to the end state in the new automaton. If there is, the two regexen intersect, otherwise they don't.


Theory.

Java library.

Usage:

/**
 * @return true if the two regexes will never both match a given string
 */
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
   Automaton automaton1 = new RegExp(regex1).toAutomaton();
   Automaton automaton2 = new RegExp(regex2).toAutomaton();
   return automaton1.intersection(automaton2).isEmpty();
}


A regular expression specifies a finite state machine that can recognize a potentially infinite set of strings. The set of strings may be infinite but the number of states must be finite, so you can examine the states one by one.

Taking your second example: In the first expression, to get from state 0 to state 1, the string must start with a digit. In the second expression, to get from state 0 to state 1, the string must start with a letter. So you know that there is no string that will get you from state 0 to state 1 in BOTH expressions. You have the answer.

Taking the first example: You know that if the string starts with a digit you can get from state 0 to state 1 with either regular expression. So now you can eliminate state 0 for each, and just answer the question for each of the two (now one state smaller) finite-state-machines.

This looks a lot like the well-known "halting problem", which as you know is unsolvable in the general case for a Turing machine or equivalent. But in fact the halting problem IS solvable for a finite-state machine, simply because there are a finite number of states.

I believe you could solve this with a non-deterministic FSM. If your regex had only one transition from each state to the next, a deterministic FSM could solve it. But a regular expression means that for instance if you are in state 2, then if the caracter is a digit you go to state 3, else if the character is a letter you go to state 4.

So here's what I would do:

  1. Solve it for the subset of FSM's that have only one transition from one state to the next. For instance a regex that matches both "Bob" and "bob", and a second regex that matches only "bob" and "boB".

  2. See if you can implement the solution in a finite state machine. I think this should be possible. The input to a state is a pair representing a transition for one FSM and a transition for the second one. For instance: State 0: If (r1, r2) is (("B" or "b"), "b") then State 1. State 1: If (r1, r2) is (("o"), ("o")) then state 2. etc.

  3. Now for the more general case, where for instance state two goes back to state two or an earlier state; for example, regex 1 recognizes only "meet" but regex 2 recognizes "meeeet" with an unlimited number of e's. You would have to reduce them to regex 1 recognizing "t" and regex 2 recognizing "t". I think this may be solvable by a non-deterministic FSM.

That's my intuition anyway.

I don't think it's NP-complete, just because my intuition tells me you should be able to shorten each regex by one state with each step.


It is possible. I encountered it once with Pellet OWL reasoner while learning semantic web technologies.

Here is an example that shows how regular expressions can be parsed into a tree structure. You could then (in theory) parse your two regular expressions to trees and see if one tree is a subset of the other tree, ie. if one tree can be found in within other tree's nodes.

If it is found, then the other regular expression will match (not only, but also) a subset of what the first regular expression will match.

It is not much of a solution, but maybe it'll help you.

0

精彩评论

暂无评论...
验证码 换一张
取 消