I have a process which writes huge data over the network. Let's say it runs on machine A and dumps around 70-80GB of file on machine B over NFS. After process 1 finishes and exits, my process 2 runs of machine A and fetches this file from machine B over NFS. The bottleneck in the entire cycle is the writing and reading of this huge data file. How can I reduce this I/O time? Can I somehow keep the data loaded in the memory, ready to use by process 2 even after process 1 has exited?
I'd appreciate ideas on this. Thanks.
Edit: since the process 2 'reads' the data directly from the network, would it be better to copy the 开发者_StackOverflow社区data locally first and then read from the local disk? I mean would (read time over network) > (cp to local disk) + (read from local disk)
If you want to keep the data loaded in memory, then you'll need 70-80 GB of RAM.
The best is maybe to attach a local storage (hard disk drive) to system A to keep this file locally.
The obvious answer is to reduce network writes - which seems could save you time on an exponential scale and improve reliability - there seems very little point in copying any file to another machine only to copy it back, so in order to answer your questions more precisely we will need more information.
There is a lot of network and IO overhead with this approach. So you may not be able to reduce the latency further down.
- Since the file is more than 80 GB, create an mmap that process 1 will write into and later process 2 can read from it - no network involved, use only machine A - but still IO overhead is unavoidable.
- More faster: both the processes can run simultaneously and you can have a semaphore or other signalling mechanism wherein process 1 can indicate process 2 that the file is ready to be read.
- Fastest approach: Let process 1 create a shared memory and share it with process 2. Whenever a limit (maximum data chunk that can be loaded into the memory, based on your RAM size) is reached, let process 1 signal process 2 that the data can be read and processed - this solution will be feasile only if the file/data can actually be processed chunks by chunks instead of one big chunk of your 80GB.
Whether you use mmap
or plain read
/write
should make little difference; either way, everything happens through the filesystem cache/buffers. The big problem is NFS. The only way you can make this efficient is by storing the intermediate data locally on machine A rather than sending it all over the network to machine B only to pull it back again right afterwards.
Use tmpfs to leverage memory as (temporary) files.
Use mbuffer with netcat to simply relay from one port to another without storing the intermediate stream, but still allowing streaming to occur at varying speeds:
machine1:8001 -> machine2:8002 -> machine3:8003
At machine2 configure a job like:
netcat -l -p 8002 | mbuffer -m 2G | netcat machine3 8003
This will allow at most 2 gigs of data to be buffered. If the buffer is filled 100%, machine2 will just start blocking reads from machine1, delaying the output stream without failing.
When machine1 had completed transmission, the second netcat
will stay around till the mbuffer is depleted
- You can use RAM disk as storage
- NFS is slow. Try use alternative way to transfer data to another PC. For sample - TCP/IP stream.
- Another solution - you can use inmemory database (TimesTen for sample)
精彩评论