开发者

Making Regular Expression more efficient

开发者 https://www.devze.com 2023-01-20 07:26 出处:网络
I\'m attempting to determine the end of an English sentence (only approximately), by looking for \"!\", \"?\" or \".\", but in the case of \".\" only when not preceeded by common abbreviations such as

I'm attempting to determine the end of an English sentence (only approximately), by looking for "!", "?" or ".", but in the case of "." only when not preceeded by common abbreviations such as Mr. or Dr.

Is there any way to make the following Regular Expression even marginally more efficient? Perhaps by sorting the negative lookbehinds by descending size, or even by alphabetical order?

Here is the regular expression I have now:

((?<!St|Sgt|Rev|Ltd|Inc|Lt|Jr|Sr|Esq|Inst|Hon|Gen|Cpl|Comdr|Col|Corp|Mr|Dr|Gov|Mrs|Ms|[A-Z]|Assn|Capt)(\.)|(!)|(\?))(\s*$|\s+([_$#]开发者_运维技巧|[A-Z][^.]))

The Problem:

The site at http://regex.powertoy.org/ reports: "7 matches 21044 probes (finished)" on even a simple paragraph... That outrageous size of the number 21044 seems intimately tied to the number of negative lookbehinds.

I'm seeking to reduce the computational complexity for the RegEx engine, as I have several GB of data to pass through it.

Is there any way to tigten this up? Is negative lookbehind truly the best/only way to accomplish this? Is there a way to do it as a lookahead instead? Is regex the wrong tool for this task?

EDIT: I am able to use either ActionScript's or PHP's RegEx engine.

EDIT: I cannot count on the number of spaces between sentences. Really!? sigh.

Please don't answer if you have no understanding of the inner workings of the RegEx engine, as pertaining to optimization.

Thanks in advance.


Perhaps try only doing the negative-lookbehind test after successfully matching . rather than at every character:

(?x:  # Allow spacing and comments
    (   
        (\.)    # First match "."
        (?<!    # Then negative-look-behind for titles followed by "."
            (?: St|Sgt|Rev|Ltd|Inc|Lt|Jr|Sr|Esq|Inst|Hon|Gen|Cpl|Comdr|Col|Corp|Mr|Dr|Gov|Mrs|Ms|[A-Z]|Assn|Capt)
            \.
        )
      |  (!)  
      |  (\?)
    )
    ( \s* $  |  \s+ ( [_$#] | [A-Z] [^.] ))
)

This took the number of probes from 70000 down to 2500 or so on powertoy.org using that site's initial help text. (But powertoy didn't like my multi-line regex or the "x" flag or something so I had to squash the regex onto one line to test it).

You can go even further by using common prefixes in your list of titles:

(?x:  # Allow spacing and comments
    (
        (\.)    # First match "."
        (?<!    # Then negative-look-behind for titles followed by "."
            (?:Assn|C(?:apt|ol|omdr|orp|pl)|Dr|Esq|G(?:en|ov)|Hon|I(?:nc|nst)|Jr|L(?:t|td)|M(?:[rs]|rs)|Rev|S(?:gt|[rt])|[A-Z])
            \.
        )
      |  (!)  
      |  (\?)
    )
    ( \s* $  |  \s+ ( [_$#] | [A-Z] [^.] ))
)

This took the probe count down to about 2000.

EDIT:
Another trick that reduces the probe-count is to include a look-ahead for a capital letter at the start of the look-behind section (but I couldn't say for sure that it makes the regex more efficient) (Also including @Swiss's suggestion for word-boundary):

        (?<!   # Then negative-look-behind for titles followed by "."
               \b (?= [A-Z] )  # But first ensure we have a capital letter before going on
               (?:Assn|C(?:apt|ol|omdr|orp|pl)|Dr|Esq|G(?:en|ov)|Hon|I(?:nc|nst)|Jr|L(?:t|td)|M(?:[rs]|rs)|Rev|S(?:gt|[rt])|[A-Z])
            \.
        )


I would match the period first. Instead of this:

(?<!St|Sgt|Rev|Ltd|Inc|...|Capt)\.

...do this:

\.(?<!(?:St|Sgt|Rev|Ltd|Inc|...|Capt)\.)

The way you have it, the regex engine has to perform the lookbehind before it even determines that the next character is a period. When I make that change, it goes from 28,423 probes to 1,945. (I'm using the default text provided by the Powertoy site, since you didn't supply any.)

I would also combine the next two alternatives - (!)|(\?) - into one, i.e., ([!?]). That brings the probe count down to 1,344. And if you don't have to capture the individual parts of the match, I suggest you use non-capturing parentheses--i.e., (?:...) instead of (...). Though it isn't reflected in the probe count, it does make the regex more efficient.

EDIT: Looking closer at you regex, I see the reason it wasn't finding any matches is the [A-Z] alternative. Since the whole match is being done in case-insensitive mode, that alternative Trumps all the others. You should either remove the /i modifier or override locally with the (?-i:...) construct. And adding the \b (word boundary), as @Swiss suggested, is a good idea, too. That leaves you with:

(?-i:\.(?<!\b(?:St|Sgt|Rev|Ltd|Inc|...|[A-Z]|Assn|Capt)\.)

...which, along with the [!?] change, yields 6 matches in 1404 probes on the Regex Powertoy site.


Stop the presses!

According to your edited question, you're using either PHP or ActionScript for this project, which means none of the solutions offered so far will do any good. Your original regex isn't just too slow, it won't work at all on the target platforms. The reason it works on the Regex Powertoy site is because that site uses the Java regex flavor, which has looser restrictions on lookbehinds.

Lookbehind is kind of a black sheep among regex features; almost all flavors impose restrictions on the kind of expression you can use in one. In Perl or Python, for example, the expression has to have a fixed length: (?<!St) will work, as will (?<!Sgt|Rev), but (?<!St|Sgt) won't. Java is much more permissive; it accepts variable-length lookbehind expressions as long as the maximum possible length can be determined ahead of time, so (?<!St|Sgt) will work, as will (?<!\w{3,12}), but (?<!\w+) won't.

The PHP and ActionScript regex flavors are both powered by the PCRE library, which is just a tiny bit looser than Perl and Python with respect to lookbehinds. If the lookbehind expression is an alternation, the length of each alternative still has to be fixed, but they don't all have to have the same length. But this is only allowed at the top "level" of the regex--that is, the lookbehind can neither contain another grouping construct, nor be contained in one.

That's the part that makes all the solutions offered so far, unworkable. To get around the grouping restriction, I had to combine the checks for ., ! and ? into [.!?], and distribute the \b and \. among the alternatives, resulting in

/([.!?])(?<!\bSt\.|\bSgt\.|\bRev\.|\bLtd\.|\bInc\.|\bLt\.|\bJr\.|\bSr\.|\bEsq\.|\bInst\.|\bHon\.|\bGen\.|\bCpl\.|\bComdr\.|\bCol\.|\bCorp\.|\bMr\.|\bDr\.|\bGov\.|\bMrs\.|\bMs\.|\b[A-Z]\.|\bAssn\.|\bCapt\.)(?:\s*$|\s+(?:[_$#]|[A-Z][^.]))/

This brings the probe count up to 2208, which is still an order of magnitude better than what you started with. Whether that will be fast enough for your several GB of text, I have no idea.

EDIT: In your comment you proposed grouping the titles by length, which didn't work. But, taking it a step further, I put each of those groups in its own lookbehind, and (to my surprise) that does seem to work. Here's how it looks in free-spacing mode:

/(?:
   \.
     (?<!\bComdr\.)
     (?<!(?=\b[A-Z])(?:Assn|C(?:apt|orp)|Inst)\.)
     (?<!(?=\b[A-Z])(?:C(?:ol|pl)|Esq|G(?:en|ov)|Hon|Inc|Ltd|Mrs|Rev|Sgt)\.)
     (?<!(?=\b[A-Z])(?:Dr|Jr|Lt|M[rs]|S[rt])\.)
     (?<!\b[A-Z]\.)
   |
   [!?]
 )
 (?:\s*$|\s+(?:[_$#]|[A-Z][^.]))
/x

To see it in action, check out the PHP demo on ideone.com.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号