How can I find items contained within an array of hashes?_问答_开发者

Here's what I've got so far:

pages = []
pages << { :uri => 'a', :page => 'd' }
pages << { :uri => 'b', :pa开发者_运维技巧ge => 'e' }
pages << { :uri => 'c', :page => 'f' }

pages.each do |page|
  puts page.has_value? 'b'
end

But what I want is just a true or false answer, as in, do you contain 'e' or 'f'?

pages.any?{|e| e.has_value? 'b'} #=> returns true or false

The sort of structure you have can result in slow lookups once it gets large. If your data set is going to be big, and you don't have a back-end database to store things in along with indexed fields to search on, then you can reduce the in-memory search by pre-building a list of the values and searching it rather than iterating over the array and peeking into the hashes. Do this once, when the array-of-hashes has been created and is stable.

vals = pages.inject([]){ |m,h| m += h.values; m } #=> ["a", "d", "b", "e", "c", "f"]

Assigning that to a variable will speed up the task of seeing if you have the value in question if you have to repeatedly do lookups:

vals.include?('a') #=> true
vals.include?('z') #=> false

If you are going to have a lot of duplicates in your hash values, it might be worth using a set instead of an array.

require 'set'
pages.inject([].to_set){ |m,h| m += h.values; m } #=> #<Set: {"a", "d", "b", "e", "c", "f"}>

The advantage to a Set is it only holds one copy of any particular element; Duplicates are ignored, keeping your search list as small as possible. The disadvantage is a Set has a bit more overhead when it is being created, slowing its creation time. Sets are based on Hashes though, so they are faster than sequential searches.

To show how slow iterating over the array can be, here's a benchmark searching through 52 hashes, looking for either 'A' or 'z', the first or last element respectively. The second two build the list of values then test for inclusion. The last two do the same only use Sets instead of Arrays.

require 'benchmark'
require 'set'

puts "RUBY_VERSION=#{ RUBY_VERSION }" 

# build a 52-element array.
pages = ( ('A'..'Z').to_a + ('a'..'z').to_a ).each_slice(2).inject([]) { |m,a| m << { uri:a[0], page:a[1] }; m }

n = 500_000
Benchmark.bm(20) do |x|
  x.report("AoH traversal A")   { n.times { pages.any?{ |h| h.has_value?('A') } } }
  x.report("AoH traversal z")   { n.times { pages.any?{ |h| h.has_value?('z') } } }
  x.report("array inclusion A") { vals = pages.inject([]){ |m,h| m += h.values; m }; n.times { vals.include?('A') }  }
  x.report("array inclusion z") { vals = pages.inject([]){ |m,h| m += h.values; m }; n.times { vals.include?('z') }  }
  x.report("set inclusion A")   { vals = pages.inject([].to_set){ |m,h| m += h.values; m }; n.times { vals.include?('A') }  }
  x.report("set inclusion z")   { vals = pages.inject([].to_set){ |m,h| m += h.values; m }; n.times { vals.include?('z') }  }
end

# >> RUBY_VERSION=1.9.2
# >>                           user     system      total        real
# >> AoH traversal A       1.140000   0.000000   1.140000 (  1.140952)
# >> AoH traversal z      19.130000   0.010000  19.140000 ( 19.135050)
# >> array inclusion A     0.450000   0.000000   0.450000 (  0.443042)
# >> array inclusion z     5.600000   0.010000   5.610000 (  5.605876)
# >> set inclusion A       0.490000   0.000000   0.490000 (  0.492484)
# >> set inclusion z       0.480000   0.000000   0.480000 (  0.479374)

EDIT:

every time I do an insert, I need to do a check first. This is part of a web spider. I will consider a db in the future.

Look at the results of the benchmark.

Your selected answer, and assumedly implemented solution, is, on average, 20x slower than using a Set for your lookup. You could maintain a Set, AND your structure and still be way ahead because the Set's lookup speed will degrade a lot more slowly. Even maintaining an Array separately is about 10x faster.

For example, check the Set for a hit or miss. If it's a hit move to the next page. If it's a miss push the info onto your Array of Hashes, then add the necessary hit information to the Set for the next loop. Do NOT totally rebuild the Set each time, only append the new information.

In a quick and dirty spider, I'd use a Hash, where the keys were the URLs I had scanned. I'd normalize them by stripping all query, session and other data from them, leaving only the base URL for the page. That way I could track if I'd already seen the page and skip it, otherwise you can end up hitting the same page multiple times as queries change. The Hash keys pointed to structures containing all the information I'd gleaned from scanning the page, so, at the end of processing I could dump the results for each page by walking the keys of the Hash.

The same tactic applies when using a database, otherwise you can fill a table with redundant page scans, instead of actually seeing unique pages.