Not getting correct output when running standard "WordCount" program using Hadoop0.20.2_问答_开发者

Not getting correct output when running standard "WordCount" program using Hadoop0.20.2

开发者 https://www.devze.com 2023-02-20 18:25 出处：网络

I\'m new to Hadoop.I have been trying to run the famous \"WordCount\" program -- which counts the total number of words

相关专题：

I'm new to Hadoop.I have been trying to run the famous "WordCount" program -- which counts the total number of words in a list of files using Hadoop-0.20.2. I'm using single node cluster.

Following is my program:

import java.io.File;
import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
    } 

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

        public void reduce(Text key, Iterator<IntWritable> values, Context context) 
        throws IOException, InterruptedException {
            int sum = 0;
            while (values.hasNext()) {
                ++sum ;
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "wordcount");        
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));        
        job.setJarByClass(WordCount.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setMapperClass(Map.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);       

        job.setReducerClass(Reduce.class);          
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);            开发者_开发技巧  
        job.setNumReduceTasks(5);        
        job.waitForCompletion(true);       

    }

}

Suppose input file is A.txt which has following contents

A B C D A B C D

When I run this program using hadoop-0.20.2 (not showing commands for sake of clarity) ,the output that comes is A 1 A 1 B 1 B ! C 1 C 1 D ! D 1

which is wrong.The actual output should be : A 2 B 2 C 2 D 2

This "WordCount" program is pretty standard program. I'm not sure what is wrong with this code. I have written the contents of all configuration files like mapred-site.xml , core-site.xml etc correctly.

How can I fix this problem?

This code actually runs a local mapreduce job. If you want to submit this to the real cluster, you have to provide the fs.default.name and the mapred.job.tracker configuration parameter. These keys are mapped to your machine with a host:port pair. Just like in your mapred/core-site.xml.
Make sure your data is available in HDFS and not on local disk, as well as your number of reducers should be reduced. That's about 2 records per reducer. You should set this to 1.

reduce signature is incorrect. Second parameter is Iterable type and not Iterator

http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduce/Reducer.html

See also Using Hadoop for the First Time, MapReduce Job does not run Reduce Phase