开发者

Java regex dies on stack overflow: need a better version

开发者 https://www.devze.com 2022-12-15 14:14 出处:网络
I\'m working on a JMD (Java MarkDown) (a Java port of MarkDownSharp) but I\'m having an issue w开发者_如何学编程ith one regex in particular. For the file Markdown_Documentation_Syntax.text this regula

I'm working on a JMD (Java MarkDown) (a Java port of MarkDownSharp) but I'm having an issue w开发者_如何学编程ith one regex in particular. For the file Markdown_Documentation_Syntax.text this regular expression dies:

private static final String BLOCK_TAGS_1 = "p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del";
private static final String BLOCKS_NESTED_PATTERN = String.format("" +
        "(" +                      // save in $1
        "^" +                      // start of line (with MULTILINE)
        "<(%s)" +                  // start tag = $2
        "\\b" +                    // word break
        "(.*\\n)*?" +              // any number of lines, minimally matching
        "</\\2>" +                 // the matching end tag
        "[ \\t]*" +                // trailing spaces/tags
        "(?=\\n+|\\Z)" +           // followed by a newline or end of
        ")", BLOCK_TAGS_1);

which translates to:

(^<(p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del)\b(.*\n)*?</\2>[ \t]*(?=\n+|\Z))

This pattern is looking for accepted block tags that are anchored to the start of a line, followed by any number of lines and then are terminated by a matching tag followed by a newline or a string terminator. This generates:

java.lang.StackOverflowError
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
        ...

This can be dealt with by increasing the stack space for Java (defaults to 128k/400k for oss/ss IIRC) but the above expression is slow anyway.

So I'm looking for a regex guru who can do better (or at least explain the performance problem with this pattern). The C# version is a little slow but works fine. PHP seems to have no issues with this either.

Edit: This is on JDK6u17 running on Windows 7 64 Ultimate.


This part:

(.*\n)*?

will involve A LOT of unnecessary backtracking because of the nested * and since there are chars that have to match afterwards.

I just ran a quick benchmark in perl on some arbitrary strings and got a 13-15% improvement just by switching that piece to

(?>.*\n)*?

which does non-capturing, independent subgrouping. That gives you two benefits, it no longer wastes time capturing the matching string, and more importantly, it no longer backtracks on the innermost .* which is a waste of time anyway. There's no way that only a portion of that .* will ever result in a valid match so explicitly making it all or nothing should help.

Don't know if that's a sufficient improvement in this case, however.


While improving the pattern does help and is advisable, Java's pattern matcher is recursive and it is generally best to switch to an iterative solution.

When I had similar problems, I switched to jregex (http://jregex.sourceforge.net/) and that worked for me.

The pattern match may have succeeded now with the improved solution, but it may fail if a text 10 times as big was given.

PS: Sorry for necromancing an old topic but this thread is ranked highly on google and it would benefit people if I put it here


The sub-expression: "(.*\\n)*?" (and the improved accepted answer version: "(?>.*\n)*?"), both have a problem: They fail to match a block element written on one line. In other words, they fail to match this:

<div>one-liner</div>

If this is not the desired behavior, a correct (and much more efficient) solution is to simply use:

.*?

And turn on single line mode.

0

精彩评论

暂无评论...
验证码 换一张
取 消