I have 100 servers
in my cluster.
At time 17:35:00
, all 100 servers
are provided with data (of size 1[MB]
). Each server processes the data, and produces an output of about 40[MB]
. The processing time for each server is 5[sec]
.
At time 17:35:05
(5[sec] later
), there's a need for a central machine to read all the开发者_StackOverflow output from all 100 servers
(remember, the total size of data is: 100 [machines] x 40 [MB] ~ 4[GB]), aggregate it, and produce an output.
It is of high importance that the entire process of gathering the 4[GB] data
from all 100 servers
takes as little time as possible. How do I go about solving this problem?
Are there any existing tools (ideally, in python
, but would consider other solutions) that can help?
Look at the flow of data in your application, and then look at the data rates that your (I assume shared) disk system provides and the rate your GigE interconnect provides, and the topology of your cluster. Which of these is a bottleneck?
GigE provides theoretical maximum 125 MB/s transmission rate between nodes - thus 4GB will take ~30s to move 100 40MB chunks of data into your central node from the 100 processing nodes over GigE.
A file system shared between all your nodes provides an alternative to over-Ethernet RAM to RAM data transfer.
If your shared file system is fast at the disk read/write level (say: a bunch of many-disk RAID 0 or RAID 10 arrays aggregated into a Lustre F/S or some such) and it uses 20Gb/s or 40 Gb/s interconnect btwn block storage and nodes, then 100 nodes each writing a 40MB file to disk and the central node reading those 100 files may be faster than transferring the 100 40 MB chunks over the GigE node to node interconnect.
But if your shared file system is a RAID 5 or 6 array exported to the nodes via NFS over GigE Ethernet, that will be slower than RAM to RAM transfer via GigE using RPC or MPI because you have to write and read the disks over GigE anyway.
So, there have been some good answers and discussion or your question. But we do (did) not know your node interconnect speed, and we do not know how your disk is set up (shared disk, or one disk per node), or whether shared disk has it's own interconnect and what speed that is.
Node interconnect speed is now known. It is no longer a free variable.
Disk set up (shared/not-shared) is unknown, thus a free variable.
Disk interconnect (assuming shared disk) is unknown, thus another free variable.
How much RAM does your central node have is unknown (can it hold 4GB data in RAM?) thus is a free variable.
If everything including shared disk uses the same GigE interconnect then it is safe to say that 100 nodes each writing a 40MB file to disk and then the central node reading 100 40MB files from disk is the slowest way to go. Unless your central node cannot allocate 4GB RAM without swapping, in which case things probably get complicated.
If your shared disk is high performance it may be the case that it is faster for 100 nodes to each write a 40MB file, and for the central node to read 100 40MB files.
Use rpyc
. It's mature and actively maintained.
Here's their blurb about what it does:
RPyC (IPA:/ɑɹ paɪ siː/, pronounced like are-pie-see), or Remote Python Call, is a transparent and symmetrical python library for remote procedure calls, clustering and distributed-computing. RPyC makes use of object-proxying, a technique that employs python's dynamic nature, to overcome the physical boundaries between processes and computers, so that remote objects can be manipulated as if they were local.
David Mertz has a quick introduction to RPyC at IBM developerWorks.
What's your networking setup ? If your central machine is connected to the cluster by a single gigabit link, it's going to take you at least ~30s to copy the 4GByte to it (and that's assuming 100% efficiency and about 8s per gigabyte, which I've never seen).
Experiment! Other answers have included tips on what to experiment with, but you might solve the problem the most straight-forward way and use that as your baseline.
You have 1meg producing 40meg of output on each server - experiment with each server compressing the data to be sent. (That compression might be free-ish if compression is part of your file system).
Latency - it is never zero.
Can you change your algorithms?
Can you do some sort of hierarchical merging of outputs rather than one CPU doing all 4Gigs at once? (Decimation in time).
It is possible to buy quad socket servers with 80 cores - would that be quicker as storage could be local, and you might configure the one machine with a lot of ram.
Can you write your code using a Python binding to MPI? MPI has facility for over the wire data transmission from M nodes to N nodes, M,N>=1.
Also, as mentioned above you could write the data to 100 files on a shared filesystem, then read the files on the 'master' node.
精彩评论