I'm working a parsing a large dataset which uses a record which has a primary and secondary keys:
- Primary Key
- Secondary Key
- Additional fields
Primary-Secondary mapping is one-to-many (primary being the 'one'). I'd like to use the number开发者_如何学Go of unique secondaries per primary in my output. There won't be more than a few thousand secondaries per primary at most.
I can think of two ways of doing this:
- Define a custom Writable type for output from Map which contains a set (hashtable,list,whatever) of the unique secondaries. Perform everything in a single map/reduce cycle, where the reducer does a union on the set of secondary keys.
- Perform the primary/secondary counting in its own operation and consume the output in a secondary job.
The former might run into some size issues with the output (where the set of keys can get large-ish). The latter will require iterating over the source data twice.
Can anyone advise on what the better approach would be here?
I'm also considering using Hive -- maybe it makes more sense to generate a table which contains all this data and doing the grouping with Hive requests?
I would give a try to straightforward emitting "primary"-"secondary" key value pairs from the mapper and then counting unique secondary keys in the reducer. If there are a few thousands secondary values at most - it should not be any problems. In the same time if you have foreign dataset - you can expect some data irregularities thereof it would be batter to put some cap on number of unique secondaries to be processed in the reducer - like stop processing when 100K values are reached and report error.
If you expect good locality of the primary values in one partition - combiner will do a good job of reducing intermediate data.
Regarding the Hive - if data in a suitable format, definitely it makes sense to give it a try. I had cases when we planned serous MR optimizations and then found that Hive is doing good enough job.
精彩评论