Parallel reducing with Hadoop mapreduce_问答_开发者

开发者 https://www.devze.com 2023-03-22 15:49 出处：网络

I\'m using Hadoop\'s MapReduce.I have a a file as an input to the map function, the map function does something (not relevant for the question).I\'d like my reducer to take the map\'s output and write

相关专题：mapreduce

I'm using Hadoop's MapReduce. I have a a file as an input to the map function, the map function does something (not relevant for the question). I'd like my reducer to take the map's output and write to two different files. The way I see it (I want an efficient solution), there are two ways in my mind:

1 reducer which will know to identify to different cases and write to 2 different contexts.
2 parallel reducers, which each one wi开发者_StackOverflow中文版ll know to identify his relevant input, ignore the other one's and this way each one will write to a file (each reducer will write to a different file).

I'd prefer the first solution, due to the fact it means I'll go over map's output only once instead of twice parallel - but if the first isn't supported in some way - I'll be glad to hear a solution for the second suggestion.

*Note: These two final files are supposed to be separated, no need into joining them at this point.

The Hadoop API has a feature for creating multiple outputs called MultipleOutputs which makes your preferred solution possible.

If you know at the map stage which file the record must go to, you can tag your map output with special key specifying which file it should go to. For eg, if a record R1 must go to file 1, you would output <1, R1>.. (1 is the key.. a symbolic representation for file1 and R1 is the value) If a Record R2 must go to file 2, your map output would be <2, R2>.

Then if you configure the map reduce job to use only 2 reducers.. it will guarantee that all records tagged with <1, _> will be sent to 1 reducer and <2, _> will be sent to the other.

This would be better than your preferred solution since you are still going thru your map output only once.. and at the same time, it would be in parallel.