Say I only need to find out whether a line read from a file contains a word from a finite set of words.
One way of doing this is to use a regex like this:
.*\y(good|better|best)\y.*
Another way of accomplishing this is using a pseudo code like this:
if ( (readLine.find("good") != string::npos) ||
(readLine.find("better") != string::npos) ||
(readLine.find("best") != string::npos) )
{
// line contains a word from a finite set of words.
}
Which way will have better performance? (i.e. 开发者_如何学Pythonspeed and CPU utilization)
The regexp will perform better, but get rid of those '.*' parts. They complicate the code and don't serve any purpose. A regexp like this:
\y(good|better|best)\y
will search through the string in a single pass. The algorithm it builds from this regexp will look first for \y, then character 1 (g|b), then character 2 (g => go or b => be), character 3 (go => goo or be => bes|bet), character 4 (go => good or bes => best or bet => bett), etc. Without building your own state machine, this is as fast as it gets.
You won't know which is faster until you've measured, but the issues at stake are:
- The regex implementation, esp. whether it needs to precompile (like Google RE2, POSIX regexes).
- The implementation of
string::find
. - The length of the string you're searching in.
- How many strings you're searching in.
My bets are on the regex, but again: you've got to measure to be sure.
Obviously not the second one (using 'find'), since you're running three comparisons (need to traverse the string at least 3 times) instead of one hopefully smart one. If the regex engine works at all like it should (and I suppose it does) then it will probably be at least three times faster.
精彩评论