Detect if a regexp is exponential_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-10 04:07 出处：网络

This article show that there开发者_StackOverflow is some regexp that is O(2^n) when backtracking. The example is (x+x+)+y.

This article show that there开发者_StackOverflow is some regexp that is O(2^n) when backtracking. The example is (x+x+)+y. When attempt to match a string like xxxx...p it going to backtrack for a while before figure it out that it couldn't match.

Is there a way to detect such regexp?

thanks

If your regexp engine exposes runtime exponential behavior for (x+x+)+y ,then it is broken because a DFA or NFA can recognize this pattern in linear time:

echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | egrep "(x+x+)+y"
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxy" | egrep "(x+x+)+y"

both answer immediately.

In fact, there are only a few cases (like backreferences) where backtracking is really needed (mainly, because a regexp with a backreference is not a regular expression in the language theoretic sense anymore). A capable implementation should switch to backtracking only when these corner cases are given.

In fairness, DFA's have a dark side too, because some regexp's have exponential size requirements, but a size contraints is easier to enforce than a time constraint and the huge DFA runs linear on the input, so it's a better bargain than a small backtracker choking on a couple of X's.

You should really read Russ Cox excellent article series about the implementation of regexp (and the pathological behavior of backtracking): http://swtch.com/~rsc/regexp/

To answer your question about decidability: You can't. Because there is not the one backtracking for regexpr. Every implementation has its own strategies to deal with exponential growth in their algorithm for certain cases and does not cover others. One rule might be fit for here and catastrophic for there.

UPDATE:

For example, one implementation could contain an optimizer which could use algebraic transformations to simplify regexps before executing them: (x+x+)+y is the same a xxx*y, which shouldn't be a problem for any backtracker. But the same optimizer wouldn't recognize the next expression and the problem is there again. Here someone described how to craft a regexpr which fools Perl's optimizer:

http://perlgeek.de/blog-en/perl-tips/in-search-of-an-exponetial-regexp.html

No I don't think so, but you can use these guidelines:

If it contains two quantifiers that are open-ended at the high end and they are nested then it might be O(2^n).
If it does not contain two such quantifiers then I think it cannot be O(2^n).

Quantifiers that can cause this are: *, + and {k,}.

Also note that the worst case complexity of evaluating a regular expression might be very different from the complexity on typical strings and that the complexity depends on the specific regular expression engine.

Any regex without backreferences can be matched in linear time, though many regex engines out there in the real world don't do it that way (at least many regex engines that are plugged into programming language runtime environments support backreferences, and don't switch to a more efficient execution model when no backreferences are present).

There's no easy way to find out how much time a regex with backreferences is going to consume.

You could detect and reject nested repetitions using a regex parser, which corresponds to a star height of 1. I've just written a module to compute and reject start heights of >1 using a regex parser from npm.

$ node safe.js '(x+x+)+y'
false
$ node safe.js '(beep|boop)*'
true
$ node safe.js '(a+){10}'
false
$ node safe.js '\blocation\s*:[^:\n]+\b(Oakland|San Francisco)\b'
true