I want to sort a big dataset efficiently (i.e. with a custom partitioner, like described here: How does the MapReduce sort algorithm work?)开发者_开发技巧, but I want to do it with hive.
However, the Hive manual states that "order by" is performed by a single reducer. This surprises me, as pig does implement something similar to the article - pig impl
Am I missing something, or is it that hive simply isn't the right hammer for this job?
I think that Hive is not right tool for the job. At least for now. It is built to be used as OLAP/Report tool and thereof is not optimized to produce large result datasets, since most of the analytical queries produce relatively small result set. As a result - they have good TOP N capability but not good total order.
Just in case if you didn't encounter it before - I am suggesting to look inte Hadoop's terasort example, which is specifically aimed to sort large dataset in a best possible way using MR. http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html
It is not possible to use multiple reducers for doing total ordering in Hive. It has not been implemented yet - https://issues.apache.org/jira/browse/HIVE-1402 .
It will be easier to use pig instead of writing custom MR job, if you want efficient total ordering.
Hive generates MapReduce job(s) for executing the queries. In your particular case the actual sorting is done by the Hadoop MapReduce framework before the data is fed into the reducer.
精彩评论