I have a system I wish to distribute where I have a number of very large non-splittable binary files I wish to process in a distributed fashion. These are of the order of a couple of hundreds of Gb. For a variety of fixed, implementation specific reasons, these files cannot be processed in parallel but have to be processed sequentially by the same process through to the end.
The application is developed in C++ so I would be considering Hadoop pipes to stream the data in and out. Each instance will need to process of the order of 100Gb to 200Gb sequentially of its own data (currently stored in one file), and the application is currently (probably) IO limited so it's important that each job is run entirely locally.
I'm very keen on HDFS for hosting this data - the ability to automatically maintain redundant copies and to rebalance as new nodes are added will be very useful. I'm also keen on map reduce for its simplicity of computation and its requirement to host the computation as close as possible to the data. However, I'm wondering how suitable Hadoop is for this particular application.
I'm aware that for representing my data it's possible to generate non-splittable files, or alternatively to generate huge sequence files (in my case, these would be of the order of 10Tb for a single file - should I pack all my data into one). And that it's therefore possible to process my data using Hadoop. However it seems like my model doesn't fit Hadoop that well: does the comm开发者_如何转开发unity agree? Or have suggestions for laying this data out optimally? Or even for other cluster computing systems that might fit the model better?
This question is perhaps a duplicate of existing questions on hadoop, but with the exception that my system requires an order of magnitude or two more data per individual file (previously I've seen the question asked about individual files of a few Gb in size). So forgive me if this has been answered before - even for this size of data.
Thanks,
Alex
It seems like you are working with relatively few numbers of large files. Since your files are huge and not splittable, Hadoop will have trouble scheduling and distributing jobs effectively across the cluster. I think the more files that you process in one batch (like hundreds), the more worth while it will be to use Hadoop.
Since you're only working with a few files, have you tried a simpler distribution mechanism, like launching processes on multiple machines using ssh, or GNU Parallel? I've had a lot of success using this approach for simple tasks. Using a NFS mounted drive on all your nodes can share limits the amount of copying you would have to do as well.
You can write a custom InputSplit for your file, but as bajafresh4life said it won't really be ideal because unless your HDFS chunk size is the same as your file size your files are going to be spread all around and there will be network overhead. Or if you do make your HDFS size match your file size then you're not getting the benefit of all your cluster's disks. Bottom line is that Hadoop may not be the best tool for you.
精彩评论