I have a requirement that my mapper may in some cases produce a new key/value for another mapper to handle. Is there a sane way to do this? I've thought about writing my own custom input format (queue?) to achieve this. Any Ideas? Thanks!
EDIT: I should clarify
Method 1
Map Step 1 (foo1, bar1) -> out1 (foo2, bar2) -> out2 (foo3, bar3) -> (fooA, barA), (fooB, barB) (foo4, bar4) -> (fooC, barC) Reduction Step 1: (out1) -> ok (out2) -> ok ((fooA, barA), (fooB, barB)) -> create Map Step 2 ((fooC, barC)) -> also send this to Map Step 2 Map Step 2: (fooA, barA) -> out3 (fooB, barB) -> (fooD, barD) (fooC, barC) -> out4 Reduction Step 2: (out3) -> ok ((fooD, barD)) -> create Map Step 3 (out4) -> ok Map Step 3: (fooD, barD) -> out5 Reduc开发者_Go百科tion Step 3: (out5) -> ok -- no more map steps. finished --
So it's fully recursive. Some key/values emit output for reduction, some generate new key/values for mapping. I don't really know how many Map or Reduction steps i may encounter on a given run.
Method 2
Map Step 1 (foo1, bar1) -> out1 (foo2, bar2) -> out2 (foo3, bar3) -> (fooA, barA), (fooB, barB) (foo4, bar4) -> (fooC, barC) (fooA, barA) -> out3 (fooB, barB) -> (fooD, barD) (fooC, barC) -> out4 (fooD, barD) -> out5 Reduction Step 1: (out1) -> ok (out2) -> ok (out3) -> ok (out4) -> ok (out5) -> ok
This Method would get the mapper to feed it's own input list. I'm not sure which way would be simpler in the end to implement.
The "Method 1" way of doing recursion through Hadoop forces you to run the full dataset through both Map and reduce for each "recursion depth". This implies that you must be sure how deep this can go AND you'll suffer a massive performance impact.
Can you say for certain that the recursion depth is limited?
If so then I would definitely go for "Method 2" and actually build the mapper in such a way that does the required recursion within one mapper call. It's simpler and saves you a lot of performance.
Use oozie [Grid workflow definition language] to string together two M/R jobs with first one only having mapper. http://yahoo.github.com/oozie
In best of my understanding Hadoop MR framework in the beginning of the job is planning what map tasks should be executed and is not ready for the new map tasks to appear dynamically.
I would suggest two possible solutions:
a) if you emit another pairs during map phase - feed them to the same mapper. So mapper will take its usual arguments and after processing will look into some kind of the internal local queue for additional pairs to process. It will work good if there are small sets of the secondary pairs, and data locality is not that important.
b) If indeed you are processing directories or something similar - you can iterate over the structure in the main of the job package, and built all splits you need right away.
精彩评论