How to get the AST of a regular expression string?_问答_开发者

How can I get the abstract syntax tree (AST) of a regular expression (in C++)?

For example,

 (XYZ)|(123)

should yield a tree of:

        |
      /   \
    .       .
   / \     开发者_StackOverflow/ \
  .   Z   .   3    
 / \     / \   
X  Y     1 2

Is there a boost::spirit grammar to parse regular expression patterns? The boost::regex library should have it, but I didn't find it. Are there any other open-source tools available that would give me the abstract representation of a regex?

I stumbled into this question again. And I decided to take a look at how hard it would actually be to write a parser for a significant subset of regular expression syntax with Boost Spirit.

So, as usual, I started out with pen and paper, and after a while had some draft rules in mind. Time to draw the analogous AST up:

namespace ast
{
    struct multiplicity 
    {
        unsigned minoccurs;
        boost::optional<unsigned> maxoccurs;
        bool greedy;

        multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1) 
            : minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true)
        { }

        bool unbounded() const { return !maxoccurs; }
        bool repeating() const { return !maxoccurs || *maxoccurs > 1; }
    };

    struct charset
    {
        bool negated;

        using range   = boost::tuple<char, char>; // from, till
        using element = boost::variant<char, range>;

        std::set<element> elements; 
        // TODO: single set for loose elements, simplify() method
    };

    struct start_of_match {};
    struct end_of_match {};
    struct any_char {};
    struct group;

    typedef boost::variant<   // unquantified expression
        start_of_match,
        end_of_match,
        any_char,
        charset,
        std::string,          // literal
        boost::recursive_wrapper<group> // sub expression
    > simple;

    struct atom               // quantified simple expression
    {
        simple       expr;
        multiplicity mult;
    };

    using sequence    = std::vector<atom>;
    using alternative = std::vector<sequence>;
    using regex       = boost::variant<atom, sequence, alternative>;

    struct group {
        alternative root;

        group() = default;
        group(alternative root) : root(std::move(root)) { }
    };
}

This is your typical AST (58 LoC) that works well with Spirit (due to integrating with boost via variant and optional, as well as having strategically chosen constructors).

The grammar ended up being only slightly longer:

template <typename It>
    struct parser : qi::grammar<It, ast::alternative()>
{
    parser() : parser::base_type(alternative)
    {
        using namespace qi;
        using phx::construct;
        using ast::multiplicity;

        alternative = sequence % '|';
        sequence    = *atom;

        simple      = 
                      (group)
                    | (charset)
                    | ('.' >> qi::attr(ast::any_char()))
                    | ('^' >> qi::attr(ast::start_of_match()))
                    | ('$' >> qi::attr(ast::end_of_match()))
                    // optimize literal tree nodes by grouping unquantified literal chars
                    | (as_string [ +(literal >> !char_("{?+*")) ])
                    | (as_string [ literal ]) // lone char/escape + explicit_quantifier
                    ;

        atom        = (simple >> quantifier); // quantifier may be implicit

        explicit_quantifier  =
                    // bounded ranges:
                      lit('?')                                   [ _val = construct<multiplicity>( 0, 1)   ]
                    | ('{'  >> uint_ >> '}' )                    [ _val = construct<multiplicity>(_1, _1)  ]
                    // repeating ranges can be marked non-greedy:
                    | (                                        
                          lit('+')                               [ _val = construct<multiplicity>( 1, boost::none) ]
                        | lit('*')                               [ _val = construct<multiplicity>( 0, boost::none) ]
                        | ('{'  >> uint_ >> ",}")                [ _val = construct<multiplicity>(_1, boost::none) ]
                        | ('{'  >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2)  ]
                        | ("{," >> uint_ >> '}' )                [ _val = construct<multiplicity>( 0, _1)  ]
                      ) >> -lit('?')       [ phx::bind(&multiplicity::greedy, _val) = false ]
                    ;

        quantifier = explicit_quantifier | attr(ast::multiplicity());

        charset     = '[' 
                   >> (lit('^') >> attr(true) | attr(false)) // negated
                   >> *(range | charset_el)
                    > ']'
                    ;

        range       = charset_el >> '-' >> charset_el;

        group       = '(' >> alternative >> ')';

        literal     = unescape | ~char_("\\+*?.^$|{()") ;

        unescape    = ('\\' > char_);

        // helper to optionally unescape waiting for raw ']'
        charset_el  = !lit(']') >> (unescape|char_);
    }

  private:
    qi::rule<It, ast::alternative()>    alternative;
    qi::rule<It, ast::sequence()>       sequence;
    qi::rule<It, ast::atom()>           atom;
    qi::rule<It, ast::simple()>         simple;
    qi::rule<It, ast::multiplicity()>   explicit_quantifier, quantifier;
    qi::rule<It, ast::charset()>        charset;
    qi::rule<It, ast::charset::range()> range;
    qi::rule<It, ast::group()>          group;
    qi::rule<It, char()>                literal, unescape, charset_el;
};

Now, the real fun is to do something with the AST. Since you want to visualize the tree, I thought of generating DOT graph from the AST. So I did:

int main()
{
    std::cout << "digraph common {\n";

    for (std::string pattern: { 
            "abc?",
            "ab+c",
            "(ab)+c",
            "[^-a\\-f-z\"\\]aaaa-]?",
            "abc|d",
            "a?",
            ".*?(a|b){,9}?",
            "(XYZ)|(123)",
        })
    {
        std::cout << "// ================= " << pattern << " ========\n";
        ast::regex tree;
        if (doParse(pattern, tree))
        {
            check_roundtrip(tree, pattern);

            regex_todigraph printer(std::cout, pattern);
            boost::apply_visitor(printer, tree);
        }
    }

    std::cout << "}\n";
}

This program results in the following graphs:

How to get the AST of a regular expression string?

The self-edges depict repeats and the colour indicates whether the match is greedy (red) or non-greedy (blue). As you can see I've optimized the AST a bit for clarity, but (un)commenting the relevant lines will make the difference:

How to get the AST of a regular expression string?

I think it wouldn't be too hard to tune. Hopefully it will serve as inspiration to someone.

Full code at this gist: https://gist.github.com/sehe/8678988

I reckon that Boost Xpressive must be able to 'almost' do this out of the box.

xpressive is an advanced, object-oriented regular expression template library for C++. Regular expressions can be written as strings that are parsed at run-time, or as expression templates that are parsed at compile-time. Regular expressions can refer to each other and to themselves recursively, allowing you to build arbitrarily complicated grammars out of them.

I'll see whether I can confirm (with a small sample).

Other thoughts include using Boost Spirit with the generic utree facility to 'store' the AST. You'd have to reproduce a grammar (which is relatively simple for common subsets of Regex syntax), so it might mean more work.

Progress Report 1

Looking at Xpressive, I made some inroads. I got some pretty pictures using DDD's great graphical data display. But not pretty enough.

Then I explored the 'code' side more: Xpressive is built upon Boost Proto. It uses Proto to define a DSEL that models regular expressions directly in C++ code. Proto generates the expression tree (generic AST, if you will) completely generically from C++ code (by overloading all possible operators). The library (Xpressive, in this case) then needs to define the semantics by walking the tree and e.g.

building a domain specific expression tree
annotating/decorating it with semantic information
possibly taking semantic action directly (e.g. how Boost Spirit does semantic actions in Qi and Karma¹)

As you can see, the sky is really the limit there, and things are looking disturbingly similar to compiler macros like in Boo, Nemerle, Lisp etc.

Visualizing Expression Trres

Now, Boost Proto expression trees can be generically visualized:

Working from the example from Expressive C++: Playing with Syntax I slightly extended Xpressive's "Hello World" example to display the expression tree:

#include <iostream>
#include <boost/xpressive/xpressive.hpp>
#include <boost/proto/proto.hpp>

using namespace boost::xpressive;

int main()
{
    std::string hello( "hello world!" );

    sregex rex = sregex::compile( "(\\w+) (\\w+)!" );

    // equivalent proto based expression
    rex = (s1= +_w) >> ' ' >> (s2= +_w) >> '!';
    boost::proto::display_expr( (s1= +_w) >> ' ' >> (s2= +_w) >> '!');

    smatch what;

    if( regex_match( hello, what, rex ) )
    {
        std::cout << what[0] << '\n'; // whole match
        std::cout << what[1] << '\n'; // first capture
        std::cout << what[2] << '\n'; // second capture
    }

    return 0;
}

The output of which is close to (note the compiler ABI specific typeid names):

shift_right(
    shift_right(
        shift_right(
            assign(
                terminal(N5boost9xpressive6detail16mark_placeholderE)
              , unary_plus(
                    terminal(N5boost9xpressive6detail25posix_charset_placeholderE)
                )
            )
          , terminal( )
        )
      , assign(
            terminal(N5boost9xpressive6detail16mark_placeholderE)
          , unary_plus(
                terminal(N5boost9xpressive6detail25posix_charset_placeholderE)
            )
        )
    )
  , terminal(!)
)
hello world!
hello
world

DISCLAIMER You should realize that this is not actually displaying the Regex AST, but rather the generic expression tree from Proto, so it is devoid of domain specific (Regex) information. I mention it because the difference is likely going to cause some more work (? unless I find a hook into Xpressive's compilation structures) for it to become truly useful for the original question.

That's it for now

I'll leave on that note, as it's lunch time and I'm picking up the kids, but this certainly grabbed my interest, so I intend to post more later!

Conclusions / Progress Report 1.0000001

The bad news right away: it won't work.

Here's why. That disclaimer was right on the money. When weekend arrived, I had already been thinking things through a bit more and 'predicted' that the whole thing would break down right where I left it off: the AST is being based on the proto expression tree (not the regex matchable_ex).

This fact was quickly confirmed after some code inspection: after compilation, the proto expression tree isn't available anymore to be displayed. Let alone when the basic_regex was specified as a dynamic pattern in the first place (there never was a proto expression for it).

I had been half hoping that the matching had been implemented directly on the proto expression tree (using proto evalutation/evaluation contexts), but quickly confirmed out that this is not the case.

So, the main takeaway is:

this is all not going to work for displaying any regex AST
the best you can do with the above is visualize a proto expression, that you'll necessarily have to create directly in your code. That is a fancy way of just writing the AST manually in that same code...

Slightly less strict observations include

Boost Proto and Boost Expressive are highly interesting libraries (I didn't mind going fishing in there). I have obviously learned a few important lessons about template meta-programming libraries, and these libraries in particular.
It is hard to design a regex parser that builds a statically typed expression tree. In fact it is impossible in the general case - it would require the compiler to instantiate all possible expressions tree combinations to a certain depth. This would obviously not scale. You could get around that by introducing polymorphic composition and using polymorphic invocation, but this would remove the benefits of template metaprogramming (compile-time optimization for the statically instantiated types/specializations).
Both Boost Regex and Boost Expressive will likely support some kind of regex AST internally (to support the matching evaluation) but
- it hasn't been exposed/documented
- there is no obvious display facility for that

_{¹ Even Spirit Lex supports them, for that matter (but not by default)}

boost::regex seems to have a hand-written recursive-descent parser in basic_regex_parser.hpp. Even though it feels awfully like re-inventing the wheel, you are probably faster when writing up the grammar in boost::spirit yourself, especially with the multitude of regex formats around.