Building a Regex Composer_问答_开发者_运维开发者技术经验分享

I was reading the Java project idea described here:

The user gives examples of what he wants and does not want to match. The program tries to deduce a regex that fits the examples. Then it generates examples that would fit and not fit. The user corrects the examples it got wrong, and it composes a new regex. Iteratively, you get a regex that is close enough to what you need.

This sounds like a really interesting i开发者_如何转开发dea to me. Does anyone has an idea as to how to do this? My first idea was something like a genetic algorithm, but I would love some input from you guys.

Actually, this starts to look more and more like a compiler application. In fact, if I remember correctly, the Aho Dragon compiler book uses a regex example to build a DFA compiler. That's the place to start. This could be a really cool compiler project.

If that is too much, you can approach it as an optimization in several passes to refine it further and further, but it will be all predefined algo's at first:

First pass: Want to match Cat, Catches cans Result: /Cat|Catches|Cans/

Second Pass: Look for similar starting conditions: Result: /Ca(t|tches|ans)/

Second Pass: Look for similar ending conditions: Result: /Ca(t|tch|an)s*/

Third Pass: Look for more refinements like repetitions and negative conditions

There exists algorithm that does exactly this for positive examples.

Regular expression are equivalent to DFA (Deterministic Finite Automata). The strategie is mostly always the same:

Look at Alergia (for the theory) and MDI algorithm (for real usage) if generate an Deterministic Automaton is enough.

The Alergia algorithm and MDI are both described here: http://www.info.ucl.ac.be/~pdupont/pdupont/pdf/icml2k.pdf

If you want to generate smaller models you can use another algorithm. The article describing it is here: http://www.grappa.univ-lille3.fr/~lemay/publi/TCS02.ps.gz

His homepage is here: http://www.grappa.univ-lille3.fr/~lemay

If you want to use negative example, I suggest you to use a simple rule (coloring) that prevent two states of the DFA to be merged.

If you ask these people, I am sure they will share their code source.

I made the same kind of algorithm during my Ph.D. for probabilistic automata. That means, you can associate a probability to each string, and I have made a C++ program that "learn" Weighted Automata.

Mainly these algorithm work like that:

from positive examples: {abb, aba, abbb}

create the simplest DFA that accept exactly all these examples:

-> x -- a --> (y) -- b --> (z)
          \
            b --> t -- b --> (v)

x cant got to state y by reading the letter 'a' for example. The states are x, y, z, t and v. (z) means it is a finite state.

then "merge" states of the DFA: (here for example the result after merging states y and t.

               _
              /  \
             v    | a,b     ( <- this is a loop :-) )
x -- a -> (y,t) _/

the new state (y,t) is a terminal state obtaining by merging y and t. And you can read the letter a and b from it. Now the DFA can accept: a(a|b)* and it is easy to construct the regular expression from the DFA.

Which states to merge is a choice that makes the main difference between algorithms.

The program tries to deduce a regex that fits the examples

I don't think it's a useful question to ask. You have to know semantically what you need to represent to deduce something. When you write a regex, you have a purpose: accepting urls, accepting emails, extracting tokens from code, etc. I would redefine the question as so: given a knowledge base and a semantic for regular expression, compute the smallest regex. This get a step further, because you have natural language trying explaining a general expression and we all know how it get ambiguous! You have to have some semantic explanation. Without that, the best thing you can do for examples is to compute regex that cover all string from the ok list.

Algorithm for coverage:

Populate Ok List
Populate Not ok List
Check for repetitions
Check for contradictions ( the same string cannot be in both list )
Create Deterministic Finite Automata (DFA) from Ok List where strings from the list are final states
Simplify the DFA by eliminating repetitive states. ([1] 4.4 )
Convert DFA to regular expression. ([1] 3.2.2 )
Test Ok list and Not ok list

[1] Introduction to Automata Theory, Language, and Computation. J. Hopcroft, R. Motwani, J.D. Ullman, 2nd Edition, Pearson Education.

P.S.

I had some experience with genetic programming and I think it's really overhead for your problem. Since the objective function is really light it's better to evaluate with a single processor and this can take a lot of time. To have shorter expression you just need to minimize the DFA. But GA may possibly produce interesting result.

Maybe I'm a bit late, but there is a way to solve this problem by means of Genetic Programming.

Genetic Programming (GP) is an evolutionary machine learning technique in which candidate a candidate solution for a given problem is represeted as an abstract syntax tree.

Several studies have been published on how to use GP in order to find a regular expression that matches a given set of examples. You can find the articles and the details here

A webapp that does this is hosted at regex.inginf.units.it. The source code behind the application has been publicly released on github

You may try to use a basic inferring algorithm that has been used in other applications. I have implemented a very basic based on building a state machine. However, it only accounts for positive samples. The source code is on http://github.com/mvaled/inferdtd

Should could be interested in the AutomataInferrer.py which is very simple.

RegexBuilder seems to have many of the features you're looking for.