Regex vs. string:find() for simple word boundary_问答_开发者

Regex vs. string:find() for simple word boundary

开发者 https://www.devze.com 2023-02-04 19:36 出处：网络

Say I only need to find out whether a line read from a file contains a word from a finite set of words.

Say I only need to find out whether a line read from a file contains a word from a finite set of words.

One way of doing this is to use a regex like this:

.*\y(good|better|best)\y.*

Another way of accomplishing this is using a pseudo code like this:

 if ( (readLine.find("good")   != string::npos) ||
      (readLine.find("better") != string::npos) ||
      (readLine.find("best")   != string::npos) )
 {
   // line contains a word from a finite set of words.
 }

Which way will have better performance? (i.e. 开发者_如何学Pythonspeed and CPU utilization)

The regexp will perform better, but get rid of those '.*' parts. They complicate the code and don't serve any purpose. A regexp like this:

\y(good|better|best)\y

will search through the string in a single pass. The algorithm it builds from this regexp will look first for \y, then character 1 (g|b), then character 2 (g => go or b => be), character 3 (go => goo or be => bes|bet), character 4 (go => good or bes => best or bet => bett), etc. Without building your own state machine, this is as fast as it gets.

You won't know which is faster until you've measured, but the issues at stake are:

The regex implementation, esp. whether it needs to precompile (like Google RE2, POSIX regexes).
The implementation of string::find.
The length of the string you're searching in.
How many strings you're searching in.

My bets are on the regex, but again: you've got to measure to be sure.

Obviously not the second one (using 'find'), since you're running three comparisons (need to traverse the string at least 3 times) instead of one hopefully smart one. If the regex engine works at all like it should (and I suppose it does) then it will probably be at least three times faster.