开发者

Nested Groups in Regex

开发者 https://www.devze.com 2022-12-27 10:58 出处:网络
I\'m constructing a regex that is looking for dates. I would like to return the date found and the sentence it was found in. In the code below, the strings on either side of date_string should check f

I'm constructing a regex that is looking for dates. I would like to return the date found and the sentence it was found in. In the code below, the strings on either side of date_string should check for the conditions of a sentence. For your sake, I've omitted the regex for date_string - sufficed to say, it works for picking out dates. While the inside of date_string isn't important, it is grouped as one entire regex.

"((?:[^.|?|!]*)"+date_string+"(?:[^.|?|!]*[.|?|!]\s*))"

The problem is that date_string is only matching the last number of any given date, presumably because the re开发者_C百科gex in front of date_string is matching too far and overrunning the date regex. For example, if I say "Independence Day is July 4.", I will get the sentence and 4, even though it should match 'July 4'. In case you're wondering, my regex inside date_string are ordered in such a way that 'July 4' should match first. Is there any way to do this all in one regex? Or do I need to split it up somehow (i.e. split up all text into sentences, and then check each sentence)?


There are several things wrong with your regex.

  1. There is no alternation in character classes. You want [^.?!], not [^.|?|!].
  2. You don't need the non-capturing groups at all.
  3. You probably don't need any "outer" grouping, since the entire match is what you look for.
  4. Your match part preceding the date is greedy where it should not be (this runs over part of your date).
  5. You make assumptions about what resembles a sentence that do not match reality. Your own example proves that, if you try.

Putting that last point aside for the moment, you end up with this version:

[^.?!]*?(July 4)[^.?!]*[.?!]\s*

Where the literal July 4 stands in for your date regex. This matches in your question text:

  1. ' For example, if I say "Independence Day is July 4.'
  2. '", I will get the sentence and 4, even though it should match 'July 4'. '

which pretty much proves my point #5.


You can make the repetition operator non-greedy by adding a question mark. In your case it would be

[^.?!]*?

And yes, splitting the text into sentences (preferably excluding the last character) would make it really easier.

(Seems like I didn't look at what was in the character class. Replaced it with tloflin's.)

0

精彩评论

暂无评论...
验证码 换一张
取 消