开发者

Find out which words in a large list occur in a small string

开发者 https://www.devze.com 2023-02-08 00:47 出处:网络
I have a static \'large\' list of words, about 300-500 words, called \'list1\' given a relatively short string str of about 40 words, what is the fastest method in ruby to get:

I have a static 'large' list of words, about 300-500 words, called 'list1'

given a relatively short string str of about 40 words, what is the fastest method in ruby to get:

  1. the number of times a word in list1 occurs i开发者_高级运维n str (counting multiple occurrences)
  2. a list of which words in list1 occur one or more times in the string str
  3. the number of words in (2)

'Occuring' in str means either as a whole word in str, or as a partial within a word in str. So if 'fred' is in list1 and str contained 'fred' and 'freddie' that would be two matches.

Everything is lowercase, so any matching does not have to care about case.

For example:

list1 ="fred sam sandy jack sue bill"
str = "and so sammy went with jack to see fred and freddie"

so str contains sam, jack, fred (twice)

for part (1) the expression would return 4 (sam+jack+fred+fred)

for part (2) the expression would return "sam jack fred"

and part (3) is 3

The 'ruby way' to do this eludes me after 4 hours... with iteration it's easy enough (but slow). Any help would be appreciated!


Here's my shot at it:

def match_freq(exprs, strings)
  rs, ss, f = exprs.split.map{|x|Regexp.new(x)}, strings.split, {}
  rs.each{|r| ss.each{|s| f[r] = f[r] ? f[r]+1 : 1 if s=~r}}
  [f.values.inject(0){|a,x|a+x}, f, f.size]
end

list1 = "fred sam sandy jack sue bill"
str = "and so sammy went with jack to see fred and freddie"
x = match_freq(list1, str)
x # => [4, {/sam/=>1, /fred/=>2, /jack/=>1}, 3]

The output of "match_freq" is an array of your output items (a,b,c). The algorithm itself is O(n*m) where n is the number of items in list1 and m is the size of the input string, I don't think you can do better than that (in terms of big-oh). But there are smaller optimizations that might pay off like keeping a separate counter for the total number of matches instead of computing it afterwards. This was just my quick hack at it.

You can extract just the matching words from the output as follows:

matches = x[1].keys.map{|x|x.source}.join(" ") # => "sam fred jack"

Note that the order won't be preserved necessarily, if that's important you'll have to keep a separate list of the order they were found.


Here's an alternative implementation, for your edification:

def match_freq( words, str )
  words  = words.split(/\s+/)
  counts = Hash[ words.map{ |w| [w,str.scan(w).length] } ]
  counts.delete_if{ |word,ct| ct==0 }
  occurring_words = counts.keys
  [
    counts.values.inject(0){ |sum,ct| sum+ct }, # Sum of counts
    occurring_words,
    occurring_words.length
  ]
end

list1 = "fred sam sandy jack sue bill"
str   = "and so sammy went with jack to see fred and freddie"
x     = match_freq(list1, str)
p x   #=> [4, ["fred", "sam", "jack"], 3]

Note that if I needed this data I would probably just return the 'counts' hash from the method and then do whatever analysis I wanted on it. If I was going to return multiple 'values' from an analysis method, I might return a Hash of named values. Although, returning an array allows you to unsplat the results:

hits, words, word_count = match_freq(list1, str)
p hits, words, word_count  
#=> 4
#=> ["fred", "sam", "jack"]
#=> 3


For faster regular expressions, use https://github.com/mudge/re2. It is a ruby wrapper for Google re2 https://code.google.com/p/re2/

0

精彩评论

暂无评论...
验证码 换一张
取 消