开发者

Process entire files in Hadoop using Python code (preferably in Dumbo)

开发者 https://www.devze.com 2023-04-01 00:39 出处：网络

It seems a very common use case but so hard to do in 开发者_JAVA百科Hadoop (it is possible with WholeFileRecordReader class).

相关专题：apache-pig python

It seems a very common use case but so hard to do in 开发者_JAVA百科Hadoop (it is possible with WholeFileRecordReader class). Is it at all possible in Dumbo or Pig? Does anyone knows a way to process whole files as map tasks using Dumbo or Pig?

WholeFileRecordReader means not split the input file? If so, define mapred.min.split.size to a very large value, both mapreduce and Pig will take it.

I am assuming you want to have one file as on record in Pig. If not, please be more specific in your question.

I don't know of a Pig storage loader that loads the entire file at once (in either the standard distribution or in piggybank). I suggest you write your own Pig custom loader, which is relatively easy.