I have two data sets one is historical quote data and other is historical trade data. Data is splitted per symbol per day basis. My question is how to load two files of same symbol in a same map function for example I want to process 2011-01-27 IBM quotes and same date IBM trade file simultaneously. How do i configure Hado开发者_如何学编程op to do this? I have read about MultlipleFileReader but this does not give us independence of load specific files together.
Thanks Ankush
Output a <$date-$symbol, $data>
pair in your map function, where $date-$symbol
is a compound key with the date and symbol concatenated together, and where $data
is either quote data or trade data. Hadoop will group together all pairs that share the same key and you can process the data in the reduce() function.
The reducer will need some logic to distinguish between quote data or trade data, depending on how you're serializing that data.
While you can do the way defined above, you can also create text file, with names of the files from both datasets - and use it as an input to the job. You can build it automatically by scanning HDFS tree. The main drawback of this solution that you will not enjoy data locality - so most of the data will travel over the network.
精彩评论