It seems a very common use case but so hard to do in 开发者_JAVA百科Hadoop (it is possible with WholeFileRecordReader class). Is it at all possible in Dumbo or Pig? Does anyone knows a way to process whole files as map tasks using Dumbo or Pig?
WholeFileRecordReader means not split the input file? If so, define mapred.min.split.size to a very large value, both mapreduce and Pig will take it.
I am assuming you want to have one file as on record in Pig. If not, please be more specific in your question.
I don't know of a Pig storage loader that loads the entire file at once (in either the standard distribution or in piggybank). I suggest you write your own Pig custom loader, which is relatively easy.
精彩评论