KH_AJU KH_AJU - 2 months ago 13
Java Question

Hadoop WordCount Combiner

In the word count example reduce function is used for both as combiner and reducer.

public static class IntSumReducer extends Reducer<Text, IntWritable, Text,IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
context.write(key, new IntWritable(sum));

I understood the way reducer works, but in the case of combiner, suppose my input is

<Java,1> <Virtual,1> <Machine,1> <Java,1>

It consider the first kv-pair and give the same output...!!?? since I've only one value. How come it considers both keys and make


since we are considering one kv pair at a time?
I know this a false assumption; someone please correct me on this please


The IntSumReducer class inherits the Reducer class and the Reducer class doing the magic here, if we look in to the documentation

"Reduces a set of intermediate values which share a key to a smaller set of values. Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.

Reducer has 3 primary phases:

Shuffle:The Reducer copies the sorted output from each Mapper using HTTP across the network.

Sort:The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged."

The program calling same class for combine and reduce operations;


so what I figured out is if we are using only one data node we don't necessarily to call the combiner class for this wordcount program since the reducer class itself take care of the combiner job.


The above method also have same effect for wordcount program if you using only one data node.