开发者

Is a collocated join (a-la-netezza) theoretically possible in hive?

开发者 https://www.devze.com 2023-03-26 15:46 出处:网络
When you join tables which are distributed on the same key and used these key columns in the join condition, then each SPU (machine) in netezza works 100% independent of the other (see nz-interview).

When you join tables which are distributed on the same key and used these key columns in the join condition, then each SPU (machine) in netezza works 100% independent of the other (see nz-interview).

In hive, there's bucketed map join, but the distribution of the files representing the tables to datanode is the responsibility of HDFS, it's not done according to hive CLUSTERED BY key!

so suppose I have 2 tables, CLUSTERED BY the same key, and I join by that key - can hive get a guarantee from HDFS that matching buckets will sit on the same node? or will it always have to m开发者_StackOverflow社区ove the matching bucket of the small table to the datanode containing the big table bucket?

Thanks, ido

(note: this is a better phrasing of my previous question: How does hive/hadoop assures that each mapper works on data that is local for it?)


I think it is not possible to tell to HDFS where to store blocks of data.
I can consider the following trick, which will do for small clusters - to increase replication factor for one of the tables to the number close or equal to the number of nodes in the cluster.
As a result - during join process appropriate data will be almost always (or always) present on the required node.

0

精彩评论

暂无评论...
验证码 换一张
取 消