Hadoop custom Writeable vs. second pass_问答_开发者

开发者 https://www.devze.com 2023-03-20 21:30 出处：网络

I\'m working a parsing a large dataset which uses a record which has a primary and secondary keys: Primary Key

相关专题：

I'm working a parsing a large dataset which uses a record which has a primary and secondary keys:

Primary Key
Secondary Key
Additional fields

Primary-Secondary mapping is one-to-many (primary being the 'one'). I'd like to use the number开发者_如何学Go of unique secondaries per primary in my output. There won't be more than a few thousand secondaries per primary at most.

I can think of two ways of doing this:

Define a custom Writable type for output from Map which contains a set (hashtable,list,whatever) of the unique secondaries. Perform everything in a single map/reduce cycle, where the reducer does a union on the set of secondary keys.
Perform the primary/secondary counting in its own operation and consume the output in a secondary job.

The former might run into some size issues with the output (where the set of keys can get large-ish). The latter will require iterating over the source data twice.

Can anyone advise on what the better approach would be here?

I'm also considering using Hive -- maybe it makes more sense to generate a table which contains all this data and doing the grouping with Hive requests?

I would give a try to straightforward emitting "primary"-"secondary" key value pairs from the mapper and then counting unique secondary keys in the reducer. If there are a few thousands secondary values at most - it should not be any problems. In the same time if you have foreign dataset - you can expect some data irregularities thereof it would be batter to put some cap on number of unique secondaries to be processed in the reducer - like stop processing when 100K values are reached and report error.
If you expect good locality of the primary values in one partition - combiner will do a good job of reducing intermediate data.
Regarding the Hive - if data in a suitable format, definitely it makes sense to give it a try. I had cases when we planned serous MR optimizations and then found that Hive is doing good enough job.