开发者

java regex not finding longest match [duplicate]

开发者 https://www.devze.com 2023-02-05 07:36 出处:网络
This question already has an answer here: Reference - What does this regex mean? 开发者_JAVA百科 (1 answer)
This question already has an answer here: Reference - What does this regex mean? 开发者_JAVA百科 (1 answer) Closed 4 years ago.

How come java (using Matcher.find()) does not find the longest possible match?

regex = "ab*(bc)?"

With input of "abbbc" the regex finds "abbb", instead of "abbbc" which also matches and is longer. Is there a way to force it to match the longest possible string?


The (bc) is an exact string, it wasn't found because the b* was greedy, but since (bc)?
is optional the match suceeded after the last 'b'. You probably want something like this: ab*[bc]? but this doesent make sense so probably ab*c?. If this regex represents something more elaborate, you should post those examples.

Here is how the regex engine sees it:

Compiling REx "ab*(bc)?"
Matching REx "ab*(bc)?" against "abbbc"
   0 <> <abbbc>              |  1:EXACT <a>(3)
   1 <a> <bbbc>              |  3:STAR(6)
                                  EXACT <b> can match 3 times out of 2147483647...
   4 <abbb> <c>              |  6:  CURLYM[1] {0,1}(16)
   4 <abbb> <c>              | 10:    EXACT <bc>(14)
                                      failed...
                                    CURLYM trying tail with matches=0...
   4 <abbb> <c>              | 16:    END(0)
Match successful!

Compiling REx "ab*[bc]?"
Matching REx "ab*[bc]?" against "abbbc"
   0 <> <abbbc>              |  1:EXACT <a>(3)
   1 <a> <bbbc>              |  3:STAR(6)
                                  EXACT <b> can match 3 times out of 2147483647...
   4 <abbb> <c>              |  6:  CURLY {0,1}(19)
                                    ANYOF[bc] can match 1 times out of 1...
   5 <abbbc> <>              | 19:    END(0)
Match successful!


The portions match greedily left to right. So the b* matches greedily which causes (bc)? to fail which is fine, so the matcher never backtracks to try a shorter b*.

Maybe ab*?(?:(?![bc])|(bc)) does what you want.


Others have helped to improve the regexp; but just to emphasize the answer is "because it does greedy matching". That is, the match you get is the one it reaches according to algorithm (which basically does longest possible submatches, from left-to-right).


If your expression actually looks so, and you don't care about grouping, it can be rewritten as ab+c?.

If expression is actually more complex and having (bc) is essential, you can use negative lookahead as follows, I think it would be more elegant than Mike Samuel's solution: ab*(?!c)(bc)?.

0

精彩评论

暂无评论...
验证码 换一张
取 消