开发者

hadoop streaming ensuring one key per reducer

开发者 https://www.devze.com 2023-04-04 22:00 出处:网络
I have a mapper that, while processing data, classifies output into 3 different types (type is the output key). My goal is to create 3 different csv files via the reducers, each with all of the data f

I have a mapper that, while processing data, classifies output into 3 different types (type is the output key). My goal is to create 3 different csv files via the reducers, each with all of the data for one key with a header row.

The key values can change and are text strings.

Now, ideally, i would like to have 3 different reducers and each reducer would get only one key with it's entire list of values.

Except, this doesn't seem to work because the keys don't get mapped to specific reducers.

The answer to this in other places has been to write a custom partitioner class that would map each desired key value to a specific reducer. This would be great except that I need to use streaming with python and i am not able to include a custom streaming jar in my job so that seems not an option.

I see in the hadoop docs that there is an alternate partitioner class available that can enable secondary sorts, but it isn't immediately obvious to me that it is possible, using either the default or key field based partitioner, to ensure that each key ends up on it's own reducer without writing a java class and using a custom streaming jar.

Any suggestions would be much appreciated.

Examples:

mapper output:

csv2\tfieldA,fieldB,fieldC csv1\tfield1,field2,field3,field4 csv3\tfieldRed,fieldGreen ...

the problem is that if i have 3 reducers i end up with key distribution like this:

reducer1        reducer2        recuder3
csv1            csv2
csv3

one reducer gets two dif开发者_如何学JAVAferent key types and one reducer gets no data sent to it at all. this is because the hash(key csv1) mod 3 and hash(key csv2) mod 3 result in the same value.


I'm pretty sure MultipleOutputFormat [1] can be used under streaming. That'll solve most of your problems.

http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html


If you are stuck with streaming, and can't include any external jars for a custom partitioner, then this is probably not going to work the way you want it to without some hacks.

If these are absolute requirements, you can get around this, but it's messy.

Here's what you can do:

Hadoop, by default, uses a hashing partitioner, like this:

key.hashCode() % numReducers

So you can pick keys such that they hash to 1, 2, and 3 (or three numbers such that x % 3 = 1, 2, 3). This is a nasty hack, and I wouldn't suggest it unless you have no other options.


If you want custom output to different csv files you can direct write (with API) to hdfs. As you know hadoop passes with key and associated value list to single reduce task. In reduce code, check , while key is same write to same file. If another key comes, create new file manually and write into it. It does not matter how many reducers you have

0

精彩评论

暂无评论...
验证码 换一张
取 消