Is it correct to say that the parallel computation with iterative MapReduce can be justified mainly when the training data size is too large for the non-parallel computation for the sa开发者_开发知识库me logic?
I am aware that the there is overhead for starting MapReduce jobs. This can be critical for overall execution time when a large number of iterations is required.
I can imagine that the sequential computation is faster than the parallel computation with iterative MapReduce as long as the memory allows to hold a data set in many cases.
No parallel processing system makes much sense if a single machine does the job, most of the time. The complexity associated with most parallelization tasks is significant and requires a good reason to make use of it.
Even when it's obvious that a task can't be resolved without parallel processing in acceptable time, parallel execution frameworks come in different flavours: from the more low-level, science-oriented tools like PVM or MPI to high-level, specialized (e.g. map/reduce) frameworks like Hadoop.
Among the parameters you should consider are start time and scalability (how close to linear does the system scale). Hadoop will not be a good choice if you need answers quickly, but might be a good choice if you can fit your process into a map-reduce frame.
You may refer to project HaLoop ( http://code.google.com/p/haloop ) which addresses exactly this problem.
精彩评论