File Processing with Elastic MapReduce - No Reducer Step?_问答_开发者

File Processing with Elastic MapReduce - No Reducer Step?

开发者 https://www.devze.com 2023-01-09 00:41 出处：网络

相关专题：mapreduce

I have a large set of text files in an S3 directory. For each text file, I want to apply a function (an executable loaded through bootstrapping) and then write the results to another text file with the same name in an output directory in S3. So there's no obvi开发者_StackOverflow社区ous reducer step in my MapReduce job.

I have tried using NONE as my reducer, but the output directory fills with files like part-00000, part-00001, etc. And there are more of these than there are files in my input directory; each part- files represents only a processed fragment.

Any advice is appreciated.

Hadoop provides a reducer called the Identity Reducer.

The Identity Reducer literally just outputs whatever it took in (it is the identity relation). This is what you want to do, and if you don't specify a reducer the Hadoop system will automatically use this reducer for your jobs. The same is true for Hadoop streaming. This reducer is used for exactly what you described you're doing.

I've never run a job that doesn't output the files as part-####. I did some research and found that you can do what you want by subclassing the OutputFormat class. You can see what I found here: http://wiki.apache.org/hadoop/FAQ#A27. Sorry I don't have an example.

To site my sources, I learned most of this from Tom White's book: http://www.hadoopbook.com/.

it seems from what i've read about hadoop is that you need a reducer even if it doesn't change the mappers output just to merge the mappers outputs

You do not need to have a reducer. You can set the number of reducers to 0 in the job configuration stage, eg

job.setNumReduceTasks(0);

Also, to ensure that each mapper processes one complete input file, you can tell hadoop that the input files are not splitable. The FileInputFormat has a method

protected boolean isSplitable(JobContext context, Path filename)

that can be used to mark a file as not splittable, which means it will be processed by a single mapper. See here for documentation. I just re-read your question, and realised that your input is probably a file with a list of filenames in it, so you most likely want to split it or it will only be run by one mapper.

What I would do in your situation is have an input which is a list of file names in s3. The mapper input is then a file name, which it downloads and runs your exe against. The output of this exe run is then uploaded to s3, and the mapper moves on to the next file. The mapper then does not need to output anything. Though it might be a good idea to output the file name processed so you can check against the input afterwards. Using the method I just outlined, you would not need to use the isSplitable method.