In summary: I feel that my system is ignoring the concept of pre-sorted tables. - I expected to save time on the sorting step because I was using pre-sorted data, but the query plan seem to indicate an intermediate sorting step.
The sordid details follow:
The Setup =======
I have set the following flags:=============
set hive.enforce.bucketing = true;
set mapred.reduce.tasks=8;
set mapred.map.tasks=8;
Here I create a table to hold a temporary copy of data on disk ========
CREATE TABLE trades
(symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
Here I copy the data on disk into the table BTW, the data here is clustered by symbol and sorted by time. I can't seem to get Hive to use this concept ... i.e avoid sorting again
LOAD DATA LOCAL INPATH '%(dir)s2010-05-07'
INTO TABLE trades
partition (dt='2010-05-07');
I use the following final table to enforce bucketing =========== and to impose sort order ===========
CREATE TABLE alltrades
(symbol STRING, exchange STRING, price FLOAT, volume INT, cond
INT, bid FLOAT, ask FLOAT, time STRING)
CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
data is loaded from a hive table ==========
insert overwrite table alltrades
select symbol, exchange, price, volume, cond, bid, ask, time
from trades
distribute by symbol sort by symbol, time;
It is disappointing to see that any query on alltrades that requires sorted symbol,time does the sorting all over again ... is there a way around this? Also, is there a way to make this whole process work in 1 query step instead of 2 ?
Why SORTING seems to not work =======
Note that the table was constructed and populated with the sort by clause. I'm afraid that dropping these would lead a future reducer to behave as if no sorting was required.
Here is the plan for a query that in my opinion should not involve sorting ... but actually does.========
hive> explain select symbol, time, price from alltrades sort by symbol, time;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL
time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol))
(TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
alltrades
TableScan
alias: alltrades
Select Operator
开发者_Python百科 expressions:
expr: symbol
type: string
expr: time
type: string
expr: price
type: float
outputColumnNames: _col0, _col1, _col2
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
sort order: ++
tag: -1
value expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: float
Reduce Operator Tree:
Extract
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
Have you checked out the effect of set hive.enforce.bucketing=true
? From http://svn.apache.org/repos/asf/hive/branches/branch-0.7/conf/hive-default.xml
<property>
<name>hive.enforce.sorting</name>
<value>false</value>
<description>Whether sorting is enforced. If true, while inserting into the table, sorting is enforced. </description>
</property>
You may also find reading the implementation of org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer#genBucketingSortingDest
useful:
http://svn.apache.org/repos/asf/hive/branches/branch-0.7/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
hive.enforce.bucketing
does not do a global sort of the data set. Instead it writes the data sorted within the buckets (in your case 8/partition). Thus it requires a global sort step to satisfy the query you are looking for.
Hope this helps, Nat
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
Also look at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
精彩评论