I need a data structure, implemented in Ruby
language, to house a huge amount of distinct urls (e.g. 10**10
pieces), so the performance is my concern.
I am using an Array
and keep its elements ordered, so I can perform binary search
to find if one already exists or where to insert a url rapidly.
I wrote this:
class UrlStore
def initialize(*args)
@urls = []
args.each { |e| add(e) unless e.class != String }
end
def add(url)
p = find(url)
@urls.insert(p, url) unless p.class == String
end
def find(url)
l, m, h = 0, 0, @urls.size - 1
while l <= h do
m = l + (h - l) / 2
b = url <=> @urls[m]
if b == 0
return m.to_s
elsif b == 1
l = m + 1
else
h = m - 1
end
end
return l
end
end
The find
method will, if found, return the position of the url in hosting array, but in String
form in order to distinguish from those positions found to be inserted; Otherwise, return an integer(Fixnum
) telling wher开发者_运维技巧e the url should go to keep the array ordered.
But note that, I use Array#insert
to add an element at a specified position. My intuition tells me that this method will move all elements after insert-point a step backward, which may cause severe performance deduction. The Array
module is not in the standard library, its Ruby's intrinsic data type, so I don't know if I am right.
May be it's so naive a data structure for hosting such a task. So can any one share an awesome one.
If, as Phrogz has suggested you do manage to get 370GB of memory, or you realise you don't actually need to store that many URLs, you might want to look into using a SortedSet.
You might want to look at the growing number of Open source NoSQL solutions including MongoDB, Cassandra, Kyoto Cabinet or Redis.
MongoHQ provides a free hosting service for MongoDB. RedisToGo provides a free hosting service for Redis. Both have very easy to use Ruby bindings. I have used both and recommend them.
I have heard good things about Cassandra and Kyoto Cabinet but have not used them in any production app.
精彩评论