开发者

Which key class is suitable for secondary sort?

开发者 https://www.devze.com 2023-01-07 17:02 出处:网络
In Hadoop you can use the secondary-sort mechanism to sort the values before they are sent to the reducer.

In Hadoop you can use the secondary-sort mechanism to sort the values before they are sent to the reducer.

The way this is don开发者_如何学运维e in Hadoop is that you add the value to sort by to the key and then have some custom group and key compare methods that hook into the sorting system.

So you'll need to have a key that consists essentially of both the real key and the value to sort by. In order to make this perform fast enough I'll need a way of creating a composite key that is also easy to decompose into the separate parts needed for the group and key compare methods.

What the smartest way is to do this. Is there an "out-of-the-box" Hadoop class that can assist me in this or do I have to create a separate key class for each map-reduce step?

How do I do this if the key actually is a composite that consists of several parts (also needed separately because of the partitioner)?

What do you guys recommend?

P.S. I wanted to add the tag "secondary-sort" but I don't have enough rep yet to do so.


I was running into this situation all the time and getting tired of writing custom composite key classes. I wrote a generic Tuple class which is a list of objects and can act as a composite key. The list may contain arbitrary number of objects of Java primitive wrapper types. It implements WritableComparable. The source can be viewed here

https://github.com/pranab/chombo/blob/master/src/main/java/org/chombo/util/Tuple.java


I am not able to understand the question. I do have a working copy SecondarySort, which prints the max value from the list of values.

https://github.com/kapild/hadoop-examples/tree/master/src/SecondarySort


You need to change the way keys repartitioned and grouped, and thisbasicakly means that you put more than 1 data type in keys, whole overriding the comparator method for partitioning and grouping....

-You can serialize/deserialize your keys, and deal with input data as objects or beans if you want strongly typed , robust code for secondary sorting...

-for simpler scenarios, just put a "#" sign between the values!

There is a great high level article on this here :

http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/


I had one situation in which i had to sort data on two columns, one was string type and another was integer type. I wrote my custom WritableComparable, and in compareTo method i wrote my logic. It is actually a best way from my point of view, as we can customize our logic of sorting.

0

精彩评论

暂无评论...
验证码 换一张
取 消