Hadoop, hardware and bioinformatics_问答_开发者

开发者 https://www.devze.com 2023-02-12 03:37 出处：网络

We\'re about to buy new hardware to run our analyses and are wondering if we\'re making the right decisions.

相关专题：hdfs mapreduce

We're about to buy new hardware to run our analyses and are wondering if we're making the right decisions.

The setting:

We're a bioinformatics lab that will be handling DNA sequencing data. The biggest issue that our field has is the amount of data, rather than the compute. A single experiment will quickly go into the 10s-100s of Gb, and we would typically run different experiments at the same time. Obviously, mapreduce approaches are interesting (see also http://abhishek-tiwari.com/2010/08/mapreduce-and-hadoop-algorithms-in-bioinformatics-papers.html), but not all our softwar开发者_如何学Pythone use that paradigm. Also, some software uses ascii files as in/output while other software works with binary files.

What we might be buying:

The machine that we might be buying would be a server with 32 cores and 192Gb of RAM, linked to NAS storage (>20Tb). This seems a very interesting setup for us for many of our (non-mapreduce) applications, but will such configuration prevent us from implementing hadoop/mapreduce/hdfs in a meaningful way?

Many thanks,

jan.

You have an interesting configuration. What would be the Disk IO for the NAS storage used by you?

Make your decision based on the following: Map Reduce paradigm is used to solve the problem of handling large amount of data. Basically, RAM is more expensive than the Disk storage. You cannot hold all the data in the RAM. Disk storage allows you to store large amounts of data at cheaper costs. But, the speed at which you can read data from the disks is not very high. How does Map Reduce solve this problem? Map Reduce solves this problem by distributing the data over multiple machines. Now, the speed at which you can read data in parallel is greater than you could have done with a single storage disk. Suppose the Disk IO speed is 100 Mbps. With 100 machines you can read the data at 100*100 Mbps = 10Gbps.

Typically processor speed is not the bottleneck. Rather, the Disk IOs are the big bottlenecks while processing large amount of data.

I have a feeling that it may not be very efficient.