Padam Sethia Padam Sethia - 3 years ago 252
Python Question

Controlling python outputs to console

I'am building a Movie recommendation using Hadoop/MapReduce.
Now I'm using only python to implement the MapReduce process.

So what I'm basically doing is running each mapper and reducer separately and using the console outputs from the mapper to the reducer.

The issue I'm having is that python outputs values as strings in the terminal , so if I'm working with numbers the numbers are printed as strings , which makes it difficult to simplify the process as the conversion of it adds more load on the server.

So how do I resolve this issue , I'm looking to implement it using pure python and no 3rd-party libs.

import sys

def mapper():
'''
From Mapper1 : we need only UserID , (MovieID , rating)
as output.
'''

#* First mapper

# Read input line
for line in sys.stdin:
# Strip whitespace and delimiter - ','
print line
data = line.strip().split(',')

if len(data) == 4:
# Using array to print out values
# Direct printing , makes python interpret
# values with comma in between as tuples
# tempout = []
userid , movieid , rating , timestamp = data
# tempout.append(userid)
# tempout.append((movieid , float(rating)))
# print tempout

#
print "{0},({1},{2})".format(userid , movieid , rating)


Here's the reducer print statement :

def reducer():

oldKey = None
rating_arr = []

for line in sys.stdin:
# So we'll recieve user, (movie,rating)
# We need to group the tuples for unique users
# we'll append the tuples to an array
# Given that we have two data points , we'll split the
# data at only first occurance of ','
# This splits the string only at first comma

data = line.strip().split(',',1)
# print len(data) , data
# check for 2 data values
if len(data) != 2:
continue

x , y = data

if oldKey and oldKey != x:

print "{0},{1}".format(oldKey , rating_arr)
oldKey = x
rating_arr = []
oldKey = x
rating_arr.append(y)
# print rating_arr
if oldKey != None:
print "%d"%rating_arr`


Input is :

671,(4973,4.5)\n671,(4993,5.0)\n670,(4995,4.0)


The output is :

671,['(4973,4.5)', '(4993,5.0)']
670,['(4995,4.0)']


I need the the tuples as it is , no strings.

Thanks!

Answer Source

The fact that data is a string, that you then split and assign y to it, it is still a string.

If you want the raw values of the tuple, as numbers, you need to parse them.

ast.literal_eval can help.

For example,

In [1]: line = """671,(4973,4.5)"""

In [2]:  data = line.strip().split(',',1)

In [3]: data
Out[3]: ['671', '(4973,4.5)']

In [4]: x , y = data

In [5]: type(y)
Out[5]: str

In [6]: import ast

In [7]: y = ast.literal_eval(y)

In [8]: y
Out[8]: (4973, 4.5)

In [9]: type(y)
Out[9]: tuple

Now, if you would like to switch to PySpark, you would have more control over the variable/object types rather than all strings with Hadoop Streaming

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download