In a distributed system, a certain node distributes 'X' units of work equally across 'N' nodes (via socket message passing).
As we increase the number of worker nodes, each nodes completes his job faster but we have to set-up more connections.
In a real situation, it would be similar to changing 10 nodes in a Hadoop-like system with each node processing 100GB by 1,000,000 nodes with each node processing 1MB.
- What's the impact of setting up more connections in this case? Is this a big overhead in p开发者_运维知识库oll() function?
- What's the best approach?
Sounds like you will need to consult Amdahl's Law.
At least it was how I computed how many machines on a high-speed switch were optimal for my parallel computations.
Does it have to use sockets and message passing between Supervisor and Worker?
You can use some type of queuing so avoid putting load onto the Supervisor. Or a distributed file system similar to HDFS to distribute the tasks and collect the results.
It also depends on the number of nodes you are planning to deploy the Workers on. 1,000,000 nodes is a very big number therefore in that case, you'll have to distribute the tasks into multiple queues.
The thing to be careful about is what will happen if all the nodes finish their tasks at the same time. It would be worth putting some variability into when they can request for a new task. ZooKeeper (http://hadoop.apache.org/zookeeper/) is potentially something you can also use to synchronise the jobs.
Can you measure your network cost? The time spent working on the worker machine should be only part of the cost of the message pass and receive.
Also can you describe the O notation for handling each worker result into the master result?
Does your master round robin expected responses?
btw -- if your worker nodes are finishing quicker but underutilizing the cpu resources you may be missing a design trade-off?
of course, you could be the rule or the exception to any law(argument/out of date research). ;-)
精彩评论