开发者

C# regex: negative lookahead fails with the single line option

开发者 https://www.devze.com 2023-01-01 12:56 出处:网络
I am trying to fig开发者_StackOverflow中文版ure out why a regex with negative look ahead fails when the \"single line\" option is turned on.

I am trying to fig开发者_StackOverflow中文版ure out why a regex with negative look ahead fails when the "single line" option is turned on.

Example (simplified):

<source>Test 1</source>
<source>Test 2</source>
<target>Result 2</target>
<source>Test 3</source>

This:

<source>(?!.*<source>)(.*?)</source>(?!\s*<target)

will fail if the single line option is on, and will work if the single line option is off. For instance, this works (disables the single line option):

(?-s:<source>(?!.*<source>)(.*?)</source>(?!\s*<target))

My understanding is that the single line mode simply allows the dot "." to match new lines, and I don't see why it would affect the expression above.

Can anyone explain what I am missing here?

::::::::::::::::::::::

EDIT: (?!.*) is a negative look ahead not a capturing group.

 <source>(?!.*?<source>)(.*?)</source>(?!\s*<target)

will ALSO FAIL if the single line mode is on, so it doesn't look like this is a greediness issue. Try it in a Regex designer (like Expresso or Rad regex):

With single line OFF, it matches (as expected):

<source>Test 1</source>    
<source>Test 3</source>

With single line ON:

<source>Test 3</source>

I don't understand why it doesn't match the first one as well: it does not contain the first negative look ahead, so it should match the expression.


I believe this is what you're looking for:

<source>((?:(?!</?source>).)*)</source>(?!\s*<target)

The idea is that you match each character one at a time, but only after making sure it isn't the first character of </source>. Also, with the addition of /? to the lookahead, you don't have to use a non-greedy quantifier.


The reason why it "fails" is because you seem to have misplaced the negative lookahead.

<source>(?!.*<source>)(.*?)</source>(?!\s*<target)
        ^^^^^^^^^^^^^^

Now, let's consider what (?!.*<source>) does here: it's a lookahead that says that there is NO match for .*<source> from that position.

Well, in single-line mode, . matches everything. After matching the first two <source>, there IS in fact .*<source>! So the negative lookahead fails for the first two <source>.

On the last <source>, .*<source> no longer match, so the negative lookahead succeeds. The rest of the pattern also succeeds, and that's why you only get <source>Test 3</source> in single-line mode.

0

精彩评论

暂无评论...
验证码 换一张
取 消