Finding if a sentence contains a specific phrase in Ruby_问答_开发者

Right now I am seeing if a sentence contains a specific w开发者_开发百科ord by splitting the sentence into an array and then doing an include to see if it contains the word. Something like:

"This is my awesome sentence.".split(" ").include?('awesome')

But I'm wondering what the fastest way to do this with a phrase is. Like if I wanted to see if the sentence "This is my awesome sentence." contains the phrase "my awesome sentence". I am scraping sentences and comparing a very large number of phrases, so speed is somewhat important.

Here are some variations:

require 'benchmark'

lorem = ('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut' # !> unused literal ignored
        'enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in' # !> unused literal ignored
        'reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,' # !> unused literal ignored
        'sunt in culpa qui officia deserunt mollit anim id est laborum.' * 10) << ' foo'


lorem.split.include?('foo') # => true
lorem['foo']                # => "foo"
lorem.include?('foo')       # => true
lorem[/foo/]                # => "foo"
lorem[/fo{2}/]              # => "foo"
lorem[/foo$/]               # => "foo"
lorem[/fo{2}$/]             # => "foo"
lorem[/fo{2}\Z/]            # => "foo"
/foo/.match(lorem)[-1]      # => "foo"
/foo$/.match(lorem)[-1]     # => "foo"
/foo/ =~ lorem              # => 621

n = 500_000

puts RUBY_VERSION
puts "n=#{ n }"
Benchmark.bm(25) do |x|
  x.report("array search:")             { n.times { lorem.split.include?('foo') } }
  x.report("literal search:")           { n.times { lorem['foo']                } }
  x.report("string include?:")          { n.times { lorem.include?('foo')       } }
  x.report("regex:")                    { n.times { lorem[/foo/]                } }
  x.report("wildcard regex:")           { n.times { lorem[/fo{2}/]              } }
  x.report("anchored regex:")           { n.times { lorem[/foo$/]               } }
  x.report("anchored wildcard regex:")  { n.times { lorem[/fo{2}$/]             } }
  x.report("anchored wildcard regex2:") { n.times { lorem[/fo{2}\Z/]            } }
  x.report("/regex/.match")             { n.times { /foo/.match(lorem)[-1]      } }
  x.report("/regex$/.match")            { n.times { /foo$/.match(lorem)[-1]     } }
  x.report("/regex/ =~")                { n.times { /foo/ =~ lorem              } }
  x.report("/regex$/ =~")               { n.times { /foo$/ =~ lorem             } }
  x.report("/regex\Z/ =~")              { n.times { /foo\Z/ =~ lorem            } }
end

And the results for Ruby 1.9.3:

1.9.3
n=500000
                                user     system      total        real
array search:              12.960000   0.010000  12.970000 ( 12.978311)
literal search:             0.800000   0.000000   0.800000 (  0.807110)
string include?:            0.760000   0.000000   0.760000 (  0.758918)
regex:                      0.660000   0.000000   0.660000 (  0.657608)
wildcard regex:             0.660000   0.000000   0.660000 (  0.660296)
anchored regex:             0.660000   0.000000   0.660000 (  0.664025)
anchored wildcard regex:    0.660000   0.000000   0.660000 (  0.664897)
anchored wildcard regex2:   0.320000   0.000000   0.320000 (  0.328876)
/regex/.match               1.430000   0.000000   1.430000 (  1.424602)
/regex$/.match              1.430000   0.000000   1.430000 (  1.434538)
/regex/ =~                  0.530000   0.000000   0.530000 (  0.538128)
/regex$/ =~                 0.540000   0.000000   0.540000 (  0.536318)
/regexZ/ =~                 0.210000   0.000000   0.210000 (  0.214547)

And 1.8.7:

1.8.7
n=500000
                               user     system      total        real
array search:             21.250000   0.000000  21.250000 ( 21.296039)
literal search:            0.660000   0.000000   0.660000 (  0.660102)
string include?:           0.610000   0.000000   0.610000 (  0.612433)
regex:                     0.950000   0.000000   0.950000 (  0.946308)
wildcard regex:            2.840000   0.000000   2.840000 (  2.850198)
anchored regex:            0.950000   0.000000   0.950000 (  0.951270)
anchored wildcard regex:   2.870000   0.010000   2.880000 (  2.874209)
anchored wildcard regex2:  2.870000   0.000000   2.870000 (  2.868291)
/regex/.match              1.470000   0.000000   1.470000 (  1.479383)
/regex$/.match             1.480000   0.000000   1.480000 (  1.498106)
/regex/ =~                 0.680000   0.000000   0.680000 (  0.677444)
/regex$/ =~                0.700000   0.000000   0.700000 (  0.704486)
/regexZ/ =~                0.700000   0.000000   0.700000 (  0.701943)

So, from the results, using a fixed string search like 'foobar'['foo'] is slower than using a regex 'foobar'[/foo/], which slower than the equivalent 'foobar' =~ /foo/.

The OPs original solution suffers badly because it traverses the string twice: Once to split it into individual words, and a second time iterating the array looking for the actual target word. Its performance will degrade worse as the string size increases.

One thing I find interesting about the performance of Ruby, is that an anchored regex is slightly slower than unanchored regex. In Perl, the opposite was true when I first ran this sort of benchmark, several years ago.

Here's an updated version using Fruity. The various expressions return different results. Any could be used if you want to see whether the target string exists. If you want to see whether the value is at the end of the string, like these are testing, or to get the location of the target, then some are definitely faster than others so pick accordingly.

require 'fruity'

TARGET_STR = (' ' * 100) + ' foo'

TARGET_STR['foo']            # => "foo"
TARGET_STR[/foo/]            # => "foo"
TARGET_STR[/fo{2}/]          # => "foo"
TARGET_STR[/foo$/]           # => "foo"
TARGET_STR[/fo{2}$/]         # => "foo"
TARGET_STR[/fo{2}\Z/]        # => "foo"
TARGET_STR[/fo{2}\z/]        # => "foo"
TARGET_STR[/foo\Z/]          # => "foo"
TARGET_STR[/foo\z/]          # => "foo"
/foo/.match(TARGET_STR)[-1]  # => "foo"
/foo$/.match(TARGET_STR)[-1] # => "foo"
/foo/ =~ TARGET_STR          # => 101
/foo$/ =~ TARGET_STR         # => 101
/foo\Z/ =~ TARGET_STR        # => 101
TARGET_STR.include?('foo')   # => true
TARGET_STR.index('foo')      # => 101
TARGET_STR.rindex('foo')     # => 101


puts RUBY_VERSION
puts "TARGET_STR.length = #{ TARGET_STR.length }"

puts
puts 'compare fixed string vs. unanchored regex'
compare do 
  fixed_str        { TARGET_STR['foo'] }
  unanchored_regex { TARGET_STR[/foo/] }
end

puts
puts 'compare /foo/ to /fo{2}/'
compare do
  unanchored_regex  { TARGET_STR[/foo/]   }
  unanchored_regex2 { TARGET_STR[/fo{2}/] }
end

puts
puts 'compare unanchored vs. anchored regex' # !> assigned but unused variable - delay
compare do 
  unanchored_regex      { TARGET_STR[/foo/]    }
  anchored_regex_dollar { TARGET_STR[/foo$/]   }
  anchored_regex_Z      { TARGET_STR[/foo\Z/] }
  anchored_regex_z      { TARGET_STR[/foo\z/] }
end

puts
puts 'compare /foo/, match and =~'
compare do
  unanchored_regex    { TARGET_STR[/foo/]           }
  unanchored_match    { /foo/.match(TARGET_STR)[-1] }
  unanchored_eq_match { /foo/ =~ TARGET_STR         }
end

puts
puts 'compare fixed, unanchored, Z, include?, index and rindex'
compare do
  fixed_str        { TARGET_STR['foo']          }
  unanchored_regex { TARGET_STR[/foo/]          }
  anchored_regex_Z { TARGET_STR[/foo\Z/]        }
  include_eh       { TARGET_STR.include?('foo') }
  _index           { TARGET_STR.index('foo')    }
  _rindex          { TARGET_STR.rindex('foo')   }
end

Which results in:

# >> 2.2.3
# >> TARGET_STR.length = 104
# >> 
# >> compare fixed string vs. unanchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 2x ± 0.1
# >> 
# >> compare /foo/ to /fo{2}/
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >> 
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 19.999999999999996% ± 10.0%
# >> unanchored_regex is similar to anchored_regex_dollar
# >> 
# >> compare /foo/, match and =~
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_eq_match is faster than unanchored_regex by 2x ± 0.1 (results differ: 101 vs foo)
# >> unanchored_regex is faster than unanchored_match by 3x ± 0.1
# >> 
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 3 seconds.
# >> _rindex is similar to include_eh (results differ: 101 vs true)
# >> include_eh is faster than _index by 10.000000000000009% ± 10.0% (results differ: true vs 101)
# >> _index is faster than fixed_str by 19.999999999999996% ± 10.0% (results differ: 101 vs foo)
# >> fixed_str is faster than anchored_regex_Z by 39.99999999999999% ± 10.0%
# >> anchored_regex_Z is similar to unanchored_regex

Modifying the size of the string reveals good stuff to know.

Changing to 1,000 characters:

# >> 2.2.3
# >> TARGET_STR.length = 1004
# >> 
# >> compare fixed string vs. unanchored regex
# >> Running each test 4096 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 50.0% ± 10.0%
# >> 
# >> compare /foo/ to /fo{2}/
# >> Running each test 2048 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >> 
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is faster than anchored_regex_Z by 10.000000000000009% ± 10.0%
# >> anchored_regex_Z is faster than unanchored_regex by 3x ± 0.1
# >> unanchored_regex is similar to anchored_regex_dollar
# >> 
# >> compare /foo/, match and =~
# >> Running each test 4096 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 1001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 2x ± 0.1
# >> 
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 4 seconds.
# >> _rindex is faster than anchored_regex_Z by 2x ± 1.0 (results differ: 1001 vs foo)
# >> anchored_regex_Z is faster than include_eh by 2x ± 0.1 (results differ: foo vs true)
# >> include_eh is faster than fixed_str by 10.000000000000009% ± 10.0% (results differ: true vs foo)
# >> fixed_str is similar to _index (results differ: foo vs 1001)
# >> _index is similar to unanchored_regex (results differ: 1001 vs foo)

Bumping it to 10,000:

# >> 2.2.3
# >> TARGET_STR.length = 10004
# >> 
# >> compare fixed string vs. unanchored regex
# >> Running each test 512 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 39.99999999999999% ± 10.0%
# >> 
# >> compare /foo/ to /fo{2}/
# >> Running each test 256 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >> 
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 3 seconds.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 21x ± 1.0
# >> unanchored_regex is similar to anchored_regex_dollar
# >> 
# >> compare /foo/, match and =~
# >> Running each test 256 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 10001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 10.000000000000009% ± 10.0%
# >> 
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 32768 times. Test will take about 18 seconds.
# >> _rindex is faster than anchored_regex_Z by 2x ± 0.1 (results differ: 10001 vs foo)
# >> anchored_regex_Z is faster than include_eh by 15x ± 1.0 (results differ: foo vs true)
# >> include_eh is similar to _index (results differ: true vs 10001)
# >> _index is similar to fixed_str (results differ: 10001 vs foo)
# >> fixed_str is faster than unanchored_regex by 39.99999999999999% ± 10.0%

Ruby v2.6.5 results:

# >> 2.6.5
# >> n=500000
# >>                                 user     system      total        real
# >> array search:               6.744581   0.012204   6.756785 (  6.766078)
# >> literal search:             0.351014   0.000334   0.351348 (  0.351866)
# >> string include?:            0.325576   0.000493   0.326069 (  0.326331)
# >> regex:                      0.373231   0.000512   0.373743 (  0.374197)
# >> wildcard regex:             0.371914   0.000356   0.372270 (  0.372549)
# >> anchored regex:             0.373606   0.000568   0.374174 (  0.374736)
# >> anchored wildcard regex:    0.374923   0.000349   0.375272 (  0.375729)
# >> anchored wildcard regex2:   0.136772   0.000384   0.137156 (  0.137474)
# >> /regex/.match               0.662532   0.003377   0.665909 (  0.666605)
# >> /regex$/.match              0.671762   0.005036   0.676798 (  0.677691)
# >> /regex/ =~                  0.322114   0.000404   0.322518 (  0.322917)
# >> /regex$/ =~                 0.332067   0.000995   0.333062 (  0.334226)
# >> /regexZ/ =~                 0.078958   0.000069   0.079027 (  0.079082)

and:

# >> 2.6.5
# >> TARGET_STR.length = 104
# >> 
# >> compare fixed string vs. unanchored regex
# >> Running each test 32768 times. Test will take about 1 second.
# >> fixed_str is faster than unanchored_regex by 2x ± 0.1
# >> 
# >> compare /foo/ to /fo{2}/
# >> Running each test 8192 times. Test will take about 1 second.
# >> unanchored_regex is similar to unanchored_regex2
# >> 
# >> compare unanchored vs. anchored regex
# >> Running each test 16384 times. Test will take about 1 second.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is similar to anchored_regex_dollar
# >> anchored_regex_dollar is similar to unanchored_regex
# >> 
# >> compare /foo/, match and =~
# >> Running each test 16384 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 101 vs foo)
# >> unanchored_regex is faster than unanchored_match by 3x ± 1.0 (results differ: foo vs )
# >> 
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 65536 times. Test will take about 3 seconds.
# >> _rindex is similar to include_eh (results differ: 101 vs true)
# >> include_eh is similar to _index (results differ: true vs 101)
# >> _index is similar to fixed_str (results differ: 101 vs foo)
# >> fixed_str is faster than anchored_regex_Z by 2x ± 0.1
# >> anchored_regex_Z is faster than unanchored_regex by 19.999999999999996% ± 10.0%

# >> 2.6.5
# >> TARGET_STR.length = 1004
# >> 
# >> compare fixed string vs. unanchored regex
# >> Running each test 32768 times. Test will take about 2 seconds.
# >> fixed_str is faster than unanchored_regex by 7x ± 1.0
# >> 
# >> compare /foo/ to /fo{2}/
# >> Running each test 2048 times. Test will take about 1 second.
# >> unanchored_regex is similar to unanchored_regex2
# >> 
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 1 second.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 3x ± 1.0
# >> unanchored_regex is similar to anchored_regex_dollar
# >> 
# >> compare /foo/, match and =~
# >> Running each test 2048 times. Test will take about 1 second.
# >> unanchored_eq_match is faster than unanchored_regex by 10.000000000000009% ± 10.0% (results differ: 1001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 39.99999999999999% ± 10.0% (results differ: foo vs )
# >> 
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 65536 times. Test will take about 4 seconds.
# >> _rindex is similar to include_eh (results differ: 1001 vs true)
# >> include_eh is similar to _index (results differ: true vs 1001)
# >> _index is similar to fixed_str (results differ: 1001 vs foo)
# >> fixed_str is faster than anchored_regex_Z by 2x ± 1.0
# >> anchored_regex_Z is faster than unanchored_regex by 4x ± 1.0


# >> 2.6.5
# >> TARGET_STR.length = 10004
# >> 
# >> compare fixed string vs. unanchored regex
# >> Running each test 8192 times. Test will take about 2 seconds.
# >> fixed_str is faster than unanchored_regex by 31x ± 10.0
# >> 
# >> compare /foo/ to /fo{2}/
# >> Running each test 512 times. Test will take about 1 second.
# >> unanchored_regex2 is similar to unanchored_regex
# >> 
# >> compare unanchored vs. anchored regex
# >> Running each test 8192 times. Test will take about 3 seconds.
# >> anchored_regex_z is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 27x ± 1.0
# >> unanchored_regex is similar to anchored_regex_dollar
# >> 
# >> compare /foo/, match and =~
# >> Running each test 512 times. Test will take about 1 second.
# >> unanchored_eq_match is similar to unanchored_regex (results differ: 10001 vs foo)
# >> unanchored_regex is faster than unanchored_match by 10.000000000000009% ± 10.0% (results differ: foo vs )
# >> 
# >> compare fixed, unanchored, Z, include?, index and rindex
# >> Running each test 65536 times. Test will take about 14 seconds.
# >> _rindex is faster than _index by 2x ± 1.0
# >> _index is similar to include_eh (results differ: 10001 vs true)
# >> include_eh is similar to fixed_str (results differ: true vs foo)
# >> fixed_str is similar to anchored_regex_Z
# >> anchored_regex_Z is faster than unanchored_regex by 26x ± 1.0

"Best way to find a substring in a string" is related.

You can easily check if a string contains another string with square brackets like so:

irb(main):084:0> "This is my awesome sentence."["my awesome sentence"]
=> "my awesome sentence"
irb(main):085:0> "This is my awesome sentence."["cookies for breakfast?"]
=> nil

it will return the sub string if it finds it or nil if it doesn't. It should be very fast.

Here's a non-answer showing the benchmark for the code by @TheTinMan for Ruby 1.9.2 on OS X. Note the difference in relative performance, specifically the improvements in the 2nd and 3rd tests.

                               user     system      total        real
array search:              7.960000   0.000000   7.960000 (  7.962338)
literal search:            0.450000   0.010000   0.460000 (  0.445905)
string include?:           0.400000   0.000000   0.400000 (  0.400932)
regex:                     0.510000   0.000000   0.510000 (  0.512635)
wildcard regex:            0.520000   0.000000   0.520000 (  0.514800)
anchored regex:            0.510000   0.000000   0.510000 (  0.513328)
anchored wildcard regex:   0.520000   0.000000   0.520000 (  0.517759)
/regex/.match              0.940000   0.000000   0.940000 (  0.943471)
/regex$/.match             0.940000   0.000000   0.940000 (  0.936782)
/regex/ =~                 0.440000   0.000000   0.440000 (  0.446921)
/regex$/ =~                0.450000   0.000000   0.450000 (  0.447904)

I ran these results with Benchmark.bmbm, but the results do not differ between the rehearsal round and the actual timings, shown above.

If you're not familiar with regular expressions, I believe they can solve your problem here:

http://www.regular-expressions.info/ruby.html

Basically you'll create a regular expression object looking for "awesome" (most likely case insensitive) and then you can do

/regex/.match(string)

To return match data. If you want to return the index the character is at you can do this:

match = "This is my awesome sentence." =~ /awesome/
puts match   #This will return the index of the first letter, so the first a in awesome

I'd read the article for more details though as it explains it better than I would. If you don't want to understand it as much and just want to jump into using it, I'd recommend this:

http://www.ruby-doc.org/core/classes/Regexp.html