I'am building a Movie recommendation using Hadoop/MapReduce.
Now I'm using only python to implement the MapReduce process.
So what I'm basically doing is running each mapper and reducer separately and using the console outputs from the mapper to the reducer.
The issue I'm having is that python outputs values as strings in the terminal , so if I'm working with numbers the numbers are printed as strings , which makes it difficult to simplify the process as the conversion of it adds more load on the server.
So how do I resolve this issue , I'm looking to implement it using pure python and no 3rd-party libs.
import sys
def mapper():
'''
From Mapper1 : we need only UserID , (MovieID , rating)
as output.
'''
#* First mapper
# Read input line
for line in sys.stdin:
# Strip whitespace and delimiter - ','
print line
data = line.strip().split(',')
if len(data) == 4:
# Using array to print out values
# Direct printing , makes python interpret
# values with comma in between as tuples
# tempout = []
userid , movieid , rating , timestamp = data
# tempout.append(userid)
# tempout.append((movieid , float(rating)))
# print tempout
#
print "{0},({1},{2})".format(userid , movieid , rating)
def reducer():
oldKey = None
rating_arr = []
for line in sys.stdin:
# So we'll recieve user, (movie,rating)
# We need to group the tuples for unique users
# we'll append the tuples to an array
# Given that we have two data points , we'll split the
# data at only first occurance of ','
# This splits the string only at first comma
data = line.strip().split(',',1)
# print len(data) , data
# check for 2 data values
if len(data) != 2:
continue
x , y = data
if oldKey and oldKey != x:
print "{0},{1}".format(oldKey , rating_arr)
oldKey = x
rating_arr = []
oldKey = x
rating_arr.append(y)
# print rating_arr
if oldKey != None:
print "%d"%rating_arr`
671,(4973,4.5)\n671,(4993,5.0)\n670,(4995,4.0)
671,['(4973,4.5)', '(4993,5.0)']
670,['(4995,4.0)']
The fact that data is a string, that you then split and assign y
to it, it is still a string.
If you want the raw values of the tuple, as numbers, you need to parse them.
ast.literal_eval
can help.
For example,
In [1]: line = """671,(4973,4.5)"""
In [2]: data = line.strip().split(',',1)
In [3]: data
Out[3]: ['671', '(4973,4.5)']
In [4]: x , y = data
In [5]: type(y)
Out[5]: str
In [6]: import ast
In [7]: y = ast.literal_eval(y)
In [8]: y
Out[8]: (4973, 4.5)
In [9]: type(y)
Out[9]: tuple
Now, if you would like to switch to PySpark, you would have more control over the variable/object types rather than all strings with Hadoop Streaming