plainter plainter - 24 days ago 7
Linux Question

MapReduce in python: os.environ ["map_input_file"] dosen't work in map.py

This is my first time to study Hadoop MapReduce with python.

I wrote a map.py to get filenames of two files in order to learn how to join two files.
Here are two CSV files:

worksheet1.csv

sno,name
1,name1
2,name2
3,name3
4,name4


worksheet2.csv

sno,courseno,grade
1,1,80
1,2,90
2,1,82
2,2,95


map.py:

#!/bin/bash
# -*- coding: utf-8 -*-
import os
import sys

def mapper():

filepath = os.environ["map_input_file"]
filename = os.path.split(filepath)[-1] #get the names
for line in sys.stdin:
if line.strip()=="":
continue
fields = line[:-1].split("\t")
sno = fields[0] #get student ID

if filename == 'worksheet1':
#get student ID and name, mark 0
name = fields[1]
print '\t'.join((sno,'0',name))
elif filename == 'worksheet2':
#get student ID, course number, grade, mark 1
courseno = fields[1]
grade = fields[2]
print '\t'.join((sno,'1',courseno,grade))


if __name__=='__main__':
mapper()


Then I use

$cat worksheet1 worksheet2 |python map.py


to test the program.

The error shows below:

Traceback (most recent call last):
File "map.py", line 30, in <module>
mapper()
File "map.py", line 11, in mapper
filepath = os.environ['map_input_file']
File "/usr/lib64/python2.7/UserDict.py", line 23, in __getitem__
raise KeyError(key)
KeyError: 'map_input_file'


Please tell me why and how to modify the code.
Thank you very much!

Answer Source

You haven't set up map_input_file environment variable. Also, you're piping your data files to your script so that they will be available as sys.stdin in the script, but your code to discover which of them is currently being read is completely wrong. I suggest just using fileinput module.