Working with a large data object between ruby processes_问答_开发者

I have a Ruby hash that reaches approximately 10 megabytes if written to a file using Marshal.dump. After gzip compression it is approximately 500 kilobytes.

Iterating through and altering this hash is very fast in ruby (fractions of a millisecond). Even copying it is extremely fast.

The problem is that I need to share the data in this hash between Ruby on Rails processes. In order to do this using the Rails cache (file_store or memcached) I need to Marshal.dump the file first, however this incurs a 1000 millisecond delay when serializing the file and a 400 millisecond delay when serializing it.

Ideally I would want to be able to save and load this hash from each process in under 100 milliseconds.

One idea is to spawn a new Ruby process to hold this hash that provides an API to the other processes to modify or process the data within it, but I want to avoid doing this unless I'm certain that there are no other ways to share this object quickly.

Is there a way I can more directly share this hash between processes without needing to serialize or deserialize it?

Here is the code I'm using to generate a hash similar to the one I'm working with:

@a = []
0.upto(500) do |r|
  @a[r] = []
  0.upto(10_000) do |c|
    if rand(10) == 0 
      @a[r][c] = 1 # 10% chance of being 1
    else
      @a[r][c] = 0
    end
  end
end

@c = Marshal.dump(@a) # 1000 milliseconds
Marshal.load(@c) # 400 milliseconds

Update:

Since my original question did not receive many responses, I'm assuming there's no solution as easy as I would have hoped.

Presently I'm considering two options:

Create a Sinatra application to store this hash with an API to modify/access it.
Create a C application to do the same as #1, but a lot faster.

The scope of my problem has increased such that the hash may be larger than my original example. So #2 may be necessary.开发者_Python百科 But I have no idea where to start in terms of writing a C application that exposes an appropriate API.

A good walkthrough through how best to implement #1 or #2 may receive best answer credit.

Update 2

I ended up implementing this as a separate application written in Ruby 1.9 that has a DRb interface to communicate with application instances. I use the Daemons gem to spawn DRb instances when the web server starts up. On start up the DRb application loads in the necessary data from the database, and then it communicates with the client to return results and to stay up to date. It's running quite well in production now. Thanks for the help!

A sinatra app will work, but the {un}serializing, and the HTML parsing could impact performance compared to a DRb service.

Here's an example, based on your example in the related question. I'm using a hash instead of an array so you can use user ids as indexes. This way there is no need to keep both a table on interests and a table of user ids on the server. Note that the interest table is "transposed" compared to your example, which is the way you want it anyways, so it can be updated in one call.

# server.rb
require 'drb'

class InterestServer < Hash
  include DRbUndumped # don't send the data over!

  def closest(cur_user_id)
    cur_interests = fetch(cur_user_id)
    selected_interests = cur_interests.each_index.select{|i| cur_interests[i]}

    scores = map do |user_id, interests|
      nb_match = selected_interests.count{|i| interests[i] }
      [nb_match, user_id]
    end
    scores.sort!
  end
end

DRb.start_service nil, InterestServer.new
puts DRb.uri

DRb.thread.join


# client.rb

uri = ARGV.shift
require 'drb'
DRb.start_service
interest_server = DRbObject.new nil, uri


USERS_COUNT = 10_000
INTERESTS_COUNT = 500

# Mock users
users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} }

# Initial send over user interests
users.each do |user|
  interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 }
end

# query at will
puts interest_server.closest(users.first[:id]).inspect

# update, say there's a new user:
new_user = {:id => 42}
users << new_user
# This guy is interested in everything!
interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true } 

puts interest_server.closest(users.first[:id])[-2,2].inspect
# Will output our first user and this new user which both match perfectly

To run in terminal, start the server and give the output as the argument to the client:

$ ruby server.rb
druby://mal.lan:51630

$ ruby client.rb druby://mal.lan:51630
[[0, 100035], ...]

[[45, 42], [45, 178902]]

Maybe it's too obvious, but if you sacrifice a little access speed to the members of your hash, a traditional database will give you much more constant time access to values. You could start there and then add caching to see if you could get enough speed from it. This will be a little simpler than using Sinatra or some other tool.

be careful with memcache, it has some object size limitations (2mb or so)

One thing to try is to use MongoDB as your storage. It is pretty fast and you can map pretty much any data structure into it.

If it's sensible to wrap your monster hash in a method call, you might simply present it using DRb - start a small daemon that starts a DRb server with the hash as the front object - other processes can make queries of it using what amounts to RPC.

More to the point, is there another approach to your problem? Without knowing what you're trying to do, it's hard to say for sure - but maybe a trie, or a Bloom filter would work? Or even a nicely interfaced bitfield would probably save you a fair amount of space.

Have you considered upping the memcache max object size?

Versions greater than 1.4.2

memcached -I 11m #giving yourself an extra MB in space

or on previous versions changing the value of POWER_BLOCK in the slabs.c and recompiling.

What about storing the data in Memcache instead of storing the Hash in Memcache? Using your code above:

@a = []
0.upto(500) do |r|
  @a[r] = []
  0.upto(10_000) do |c|
    key = "#{r}:#{c}"
    if rand(10) == 0 
      Cache.set(key, 1) # 10% chance of being 1
    else 
      Cache.set(key, 0)
    end
  end
end

This will be speedy and you won't have to worry about serialization and all of your systems will have access to it. I asked in a comment on the main post about accessing the data, you will have to get creative, but it should be easy to do.