How effi开发者_如何学编程cient are opensource distributed computation frameworks like Hadoop? By efficiency, I mean CPU cycles that can be used for the "actual job" in tasks that are mostly pure computation. In other words, how much CPU cycles are used for overhead, or wasted because of being not used? I'm not looking for specific numbers, just a rough picture. E.g. can I expect to use 90% of the cluster's CPU power? 99%? 99.9%?
To be more specific, let's say I want to calculate PI, and I have an algorithm X. When I perform this on a single core in a tight loop, let's say I get some performance Y. If I do this calculation in a distributed fashion using e.g. Hadoop, How much performance degradation can I expect?
I understand this would depend on many factors, but what would be the rough magnitude? I'm thinking of a cluster with maybe 10 - 100 servers (80 - 800 CPU cores total), if that matters.
Thanks!
Technically hadoop has considerable overheads in several dimensions:
a) Per task overhead which can be estimated from 1 to 3 seconds.
b) HDFS Data reading overhead, due to passing data via socket and CRC calculation. It is harder to estimate
These overheads can be very significant if you have a lot of small tasks, and/or if your data processing is light.
In the same time if your have big files (less tasks) and Your data processing is heavy (let say a few mb/sec per core) then Hadoop overhead can be negleted.
In a bottom line - Hadoop overhead is variable things which higly depends on the nature of processing you are doing.
This question is too broad and vague to answer usefully. There are many different open-source platforms, varying very widely in their quality. Some early Beowulfs were notoriously wasteful, for example, whereas modern MPI2 is pretty lean.
Also, "efficiency" means different things in different domains. It might mean the amount of CPU overhead spent on constructing and passing messages relative to the work payload (in which case you're comparing MPI vs Map/Reduce), or it might mean the number of CPU cycles wasted by the interpreter/VM, if any (in which case you're comparing C++ vs Python).
It depends on the problem you are trying to solve, too. In some domains, you have lots of little messages flying back and forth, in which case the CPU cost of constructing them matters a lot (like high-frequency trading). In others, you have relatively few but large work-blocks, so the cost of packing the messages is small compared to the computational efficiency of the math inside the work block (like Folding@Home).
So in summary, this is an impossible question to answer generally, because there's no one answer. It depends on specifically what you're trying to do with the distributed platform, and what machinery it is running on.
MapR is one of the alternative for Apache Hadoop and Srivas (CTO and founder of MapR) has compared MapR with Apache Hadoop. The below presentation and video have metrics comparing MapR and Apache Hadoop. Looks like the hardware is not efficiently used in Apache Hadoop.
http://www.slideshare.net/mcsrivas/design-scale-and-performance-of-maprs-distribution-for-hadoop
http://www.youtube.com/watch?v=fP4HnvZmpZI
Apache Hadoop seems to be inefficient in some dimensions, but there is a lot of activity in Apache Hadoop community around scalability/reliability/availability/efficiency. Next Generation MapReduce, HDFS Scalability/Availability are some of things being worked currently. These would be available in the Hadoop version 0.23.
Till some time back, the focus of the Hadoop community seemed to be on scalability, but now shifting towards efficiency also.
精彩评论