user1157751 user1157751 - 20 days ago 4x
Python Question

MapReduce (Python) - How to sort reducer output for Top-N list?

I'm pretty new to MapReduce. Currently trying to complete the udacity course on Hadoop MapReduce.

I have a mapper that parses a forum nodes, and I will get the tags associated with each node. My objective is to sort the top 10 tags.

An example output:

video 1
cs101 1
meta 1
bug 1
issues 1
nationalities 1
cs101 1
welcome 1
cs101 1
cs212 1
cs262 1
cs253 1
discussion 1
meta 1

It is pretty easy to add them all up in reducer:


import sys
import string

total = 0
oldKey = None

for line in sys.stdin:
data_mapped = line.strip().split("\t")

if(len(data_mapped) != 2):
print "====================="
print line.strip()
print "====================="

key, value = data_mapped

if oldKey and oldKey != key:
print total, "\t", oldKey
oldKey = key;
total = 0

oldKey = key
total += float(value)

if oldKey != None:
print total, "\t", oldKey


1.0 application
1.0 board
1.0 browsers
1.0 bug
8.0 cs101
1.0 cs212
5.0 cs253
1.0 cs262
1.0 deadlines
1.0 digital
5.0 discussion
1.0 google-appengine
2.0 homework
1.0 html
1.0 hungarian
1.0 hw2-1
3.0 issues
2.0 jobs
2.0 lessons

I know that the keys are sorted in the output of a mapper, hence I just test if the keys are still the same tag. If not, then I'll output the # of times that a tag appears.

However, the problem is how do I sort this list?

I can sort the list in python if I store all the information in a list or a dictionary. However, it seems like a bad design decision, because if you have a lot of different tags, you will go out of memory.


I believe you can use the collections.Counter class here:

Example: (modified from your code)


import sys
import collections

counter = collections.Counter()

for line in sys.stdin:
    k, v = line.strip().split("\t", 2)

    counter[k] += int(v)

print counter.most_common(10)

The collections.Counter() class implements this exact use-case and many other common use-cases around counting things and collecting various stats, etc.

8.3.1. Counter objects A counter tool is provided to support convenient and rapid tallies. For example: