I have a network of 100 machines, all running Ubuntu Linux.
On a continuous (streaming) basis, machine X
is 'fed' with some real-time data. I need to write a python script that would get the data as input, load it in-memory, process it, and then save it to disk.
It's a lot of data, hence, I would ideally want to split the data in memory (using some logic) and just send pieces of it to each individual computer, in the fastest possible way. each individual 开发者_如何学JAVAcomputer will accept its piece of data, handle it and write it to its local disk.
Suppose I have a container of data in Python (be it a list, a dictionary etc), already processed and split to pieces. What is the fastest way to send each 'piece' of data to each individual machine?
You should take a look at pyzmq
:
http://www.zeromq.org/bindings:python
and general guides to zeromq (0mq)
http://nichol.as/zeromq-an-introduction
http://www.zeromq.org/
You have two (classes of) choices:
- You could build some distribution mechanism yourself.
- You could use an existing tool to handle the distribution and storage.
In the simplest case, you write a program on each machine in your network that simply listens, processes and writes. You distribute from X to each machine in your pool round-robin. But, you might want to address higher-level concerns like handling node failures or dealing with requests that take longer to process than others, adding new nodes to the system, etc.
As you want more functionality, you'll probably want to find some existing tool to help you. It sounds like you might want to investigate some combinations of AMQP (for reliable messaging), Hadoop (for distributed data processing) or more complete NoSQL solutions like Cassandra or Riak. By leveraging these tools, your system will be significantly more robust than what you could probably build out yourself.
What you want is a message queue like RabbitMQ. It is easy to add consumers and producers to a queue. Consumer can either poll or get notified through a callback...
精彩评论