开发者

Python Streaming : how to reduce to multiple outputs?(its possible with Java though)

开发者 https://www.devze.com 2023-04-10 01:08 出处:网络
I read Hadoop in Action and found that in Java using MultipleOutputFormat and MultipleOutputs classes we can reduce the data to multiple files but what I am not sure is how to achieve the same thing u

I read Hadoop in Action and found that in Java using MultipleOutputFormat and MultipleOutputs classes we can reduce the data to multiple files but what I am not sure is how to achieve the same thing using Python streaming.

for example:

                  / out1/part-0000
mapper -> reducer   
                  \ out2/part-0开发者_如何学运维000

If anyone knows, heard, done similar thing, please let me know


The Dumbo Feathers, a set of java classes to use together with Dumbo (a python library that makes it easy to write efficient python M/R programs for hadoop), does this in its output classes.

Basically, in your python dumbo M/R job, you output a key that is a tuple of two elements - the first element being the name of the directory to output to, the second element being the actual key. The output class you've selected then inspects the tuple to find what output directory to use, and use MultipleOutputFormat to write to different subdirectories.

With dumbo, this is easy due to the use of typedbytes as output format, but I think it should be doable even if you have other output formats.

0

精彩评论

暂无评论...
验证码 换一张
取 消