开发者

Is regex too slow? Real life examples where simple non-regex alternative is better

开发者 https://www.devze.com 2022-12-27 14:48 出处:网络
I\'ve seen people here made comments like \"regex is too slow!\", or \"why would you do something so simple using regex!\" (and then present a 10+ lines alternative instead), etc.

I've seen people here made comments like "regex is too slow!", or "why would you do something so simple using regex!" (and then present a 10+ lines alternative instead), etc.

I haven't really used regex in industrial setting, so I'm curious if there are applications where regex is demonstratably j开发者_如何学Pythonust too slow, AND where a simple non-regex alternative exists that performs significantly (maybe even asymptotically!) better.

Obviously many highly-specialized string manipulations with sophisticated string algorithms will outperform regex easily, but I'm talking about cases where a simple solution exists and significantly outperforms regex.

What counts as simple is subjective, of course, but I think a reasonable standard is that if it uses only String, StringBuilder, etc, then it's probably simple.


Note: I would very much appreciate answers that demonstrate the following:

  1. a beginner-level regex solution to a non-toy real-life problem that performs horribly
  2. the simple non-regex solution
  3. the expert-level regex rewrite that performs comparably


I remember a textbook example of a regex gone bad. Be aware that none of the following approaches is recommended for production use! Use a proper CSV parser instead.

The mistake made in this example is quite common: Using a dot where a narrower character class is better suited.

In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).

1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 // don't match
42,12,13,12,32,13,14,43,56,31,78,10 // match
42,12,13,12,32,14,13,43,56,31,78,10 // don't match

We use a regex containing exactly 11 commas:

".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*"

This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)

A simple non-regex solution would be to split() each line and compare the 6th element. (Much faster: 9 microseconds per string.)

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

So we replace the greedy quantifier with the reluctant one:

".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?"

This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.

A performant regex replaces the dot by the character class "[^,]":

"[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*"

(This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)


I experimented a bit with the performance of various constructs, and unfortunately I discovered that Java regex doesn't perform what I consider very doable optimizations.

Java regex takes O(N) to match "(?s)^.*+$"

This is very disappointing. It's understandable for ".*" to take O(N), but with the optimization "hints" in the form of anchors (^ and $) and single-line mode Pattern.DOTALL/(?s), even making the repetition possessive (i.e. no backtracking), the regex engine still could not see that this will match every string, and still have to match in O(N).

This pattern isn't very useful, of course, but consider the next problem.

Java regex takes O(N) to match "(?s)^A.*Z$"

Again, I was hoping that the regex engine can see that thanks to the anchors and single-line mode, this is essentially the same as the O(1) non-regex:

 s.startsWith("A") && s.endsWith("Z")

Unfortunately, no, this is still O(N). Very disappointing. Still, not very convincing because a nice and simple non-regex alternative exists.

Java regex takes O(N) to match "(?s)^.*[aeiou]{3}$"

This pattern matches strings that ends with 3 lowercase vowels. There is no nice and simple non-regex alternative, but you can still write something non-regex that matches this in O(1), since you only need to check the last 3 characters (for simplicity, we can assume that the string length is at least 3).

I also tried "(?s)^.*$(?<=[aeiou]{3})", in an attempt to tell the regex engine to just ignore everything else, and just check the last 3 characters, but of course this is still O(N) (which follows from the first section above).

In this particular scenario, however, regex can be made useful by combining it with substring. That is, instead of seeing if the whole string matches the pattern, you can manually restrict the pattern to attempt to match only the last 3 characters substring. In general, if you know before hand that the pattern has a finite length maximum match, you can substring the necessary amount of characters from the end of a very long string and regex on just that part.


Test harness

static void testAnchors() {
    String pattern = "(?s)^.*[aeiou]{3}$";
    for (int N = 1; N < 20; N++) {
        String needle = stringLength(1 << N) + "ooo";
        System.out.println(N);
        boolean b = true;
        for (int REPS = 10000; REPS --> 0; ) {
            b &= 
              needle
              //.substring(needle.length() - 3) // try with this
              .matches(pattern);
        }
        System.out.println(b);
    }
}

The string length in this test grows exponentially. If you run this test, you will find that it starts to really slow down after 10 (i.e. string length 1024). If you uncomment the substring line, however, the entire test will complete in no time (which also confirms that the problem is not because I didn't use Pattern.compile, which would yield a constant improvement at best, but rather because the patttern takes O(N) to match, which is problematic when the asymptotic growth of N is exponential).


Conclusion

It seems that Java regex does little to no optimization based on the pattern. Suffix matching in particular is especially costly, because the regex still needs to go through the entire length of the string.

Thankfully, doing the regex on the chopped suffix using substring (if you know the maximum length of the match) can still allow you to use regex for suffix matching in time independent of the length of the input string.

//update: actually I just realized that this applies to prefix matching too. Java regex matches a O(1) length prefix pattern in O(N). That is, "(?s)^[aeiou]{3}.*$" checks if a string starts with 3 lowercase letters in O(N) when it should be optimizable to O(1).

I thought prefix matching would be more regex-friendly, but I don't think it's possible to come up with a O(1)-runtime pattern to match the above (unless someone can prove me wrong).

Obviously you can do the s.substring(0, 3).matches("(?s)^[aeiou]{3}.*$") "trick", but the pattern itself is still O(N); you've just manually reduced N to a constant by using substring.

So for any kind of finite-length prefix/suffix matching of a really long string, you should preprocess using substring before using regex; otherwise it's O(N) where O(1) suffices.


In my tests, I found the following:

Using java's String.split method (which uses regex) took 2176ms under 1,000,000 iterations. Using this custom split method took 43ms under 1,000,000 iterations.

Of course, it will only work if your "regex" is completely literal, but in those cases, it will be much faster.

List<String> array = new ArrayList<String>();
String split = "ab";
String string = "aaabaaabaa";
int sp = 0;
for(int i = 0; i < string.length() - split.length(); i++){              
    if(string.substring(i, i + split.length()).equals(split)){
        //Split point found
        array.add(string.substring(sp, i));
        sp = i + split.length();
        i += split.length();
    }
}
if(sp != 0){
    array.add(string.substring(sp, string.length()));
}
return array;

So to answer your question, is it theoretically faster? Yes, absolutely, my algorithm is O(n), where n is the length of the string to split. (I'm not sure what regex would be). Is it practically faster? Well, over 1 million iterations, I saved basically 2 seconds. So, it depends on your needs I guess, but I wouldn't worry too much about backporting all code that uses regex to non-regex versions, and in fact, that might be necessary anyways, if the pattern is very complex, a literal split like this won't work. However, if you're splitting on, say, commas, this method will perform much better, though "much better" is subjective here.


Well, not always but sometimes slow, depends on patterns and implementations.

A quick example, 2x slower than normal replace, but I don't think its that slow.

>>> import time,re
>>>
>>> x="abbbcdexfbeczexczczkef111anncdehbzzdezf" * 500000
>>>
>>> start=time.time()
>>> y=x.replace("bc","TEST")
>>> print time.time()-start,"s"
0.350999832153 s
>>>
>>> start=time.time()
>>> y=re.sub("bc","TEST",x)
>>> print time.time()-start,"s"
0.751000165939 s
>>>
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号