I have data files arranged in folders named as dates. Directory structure
- /data/2011/01/01
- /data/2011/01/02
and so on and inside each directory there are around 50 files I need to parsed and I a开发者_Go百科m giving input to hadoop as /data/** /** /** so that It can parse all the files. My questions are
- How can I ask hadoop to order the input. I need to parse the files date by date.
- While parsing files of particular date, I need to pre load a datastructure associated with that date and is in the same date directory.
Thanks Ankush
- You can't order the input. In a "worst case" scenario if you have the same number of input files as you have running tasks in a cluster they will all be processed at the same moment in parallel.
- Perhaps you can create a custom implementation of "FileInputFormat" that reads the required config file and does what you need?
精彩评论