Hadoop: Iterative MapReduce Performance_问答_开发者

开发者 https://www.devze.com 2022-12-27 20:02 出处：网络

Is it correct to say that the parallel computation with iterative MapReduce can be justified mainly when the training data size is too large for the non-parallel computation for the sa开发者_开发知识库

I am aware that the there is overhead for starting MapReduce jobs. This can be critical for overall execution time when a large number of iterations is required.

I can imagine that the sequential computation is faster than the parallel computation with iterative MapReduce as long as the memory allows to hold a data set in many cases.

No parallel processing system makes much sense if a single machine does the job, most of the time. The complexity associated with most parallelization tasks is significant and requires a good reason to make use of it.

Even when it's obvious that a task can't be resolved without parallel processing in acceptable time, parallel execution frameworks come in different flavours: from the more low-level, science-oriented tools like PVM or MPI to high-level, specialized (e.g. map/reduce) frameworks like Hadoop.

Among the parameters you should consider are start time and scalability (how close to linear does the system scale). Hadoop will not be a good choice if you need answers quickly, but might be a good choice if you can fit your process into a map-reduce frame.

You may refer to project HaLoop ( http://code.google.com/p/haloop ) which addresses exactly this problem.