mezz mezz - 1 month ago 7
Python Question

Importing text file : No Columns to parse from file

I am trying to take input from sys.stdin. This is a map reducer program for hadoop. Input file is in txt form. Preview of the data set:

196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
286 1014 5 879781125
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
291 1042 4 874834944


Code that I have been trying -

import sys
df = pd.read_csv(sys.stdin,error_bad_lines=False)


I have also tried with
delimiter = \t, header=False,defining column name

Nothing seems to work, the error I am getting is this error:

[root@sandbox lab]# cat /root/lab/u.data | python /root/lab/mid-1-mapper.py |python /root/lab/mid-1-reducer.py
Traceback (most recent call last):
File "/root/lab/mid-1-reducer.py", line 8, in <module>
df = pd.read_csv(sys.stdin,delimiter='\t')
File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 388, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 729, in __init__
self._make_engine(self.engine)
File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 922, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1389, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 538, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5896)
pandas.io.common.EmptyDataError: No columns to parse from file


However, if when I try this directly in python(not in hadoop), it works fine.

I have tried to looked into stackoverflow posts, one of the post suggested try and except. Applying that leaves me with a empty file.
Can anybody help? Thanks

Answer

Using try and except just lets you continue in spite of errors and handle them. It won't wont magically fix your errors.

read_csv expects csv files, which your input is obviously not. A quick look into the documentation:

delim_whitespace : boolean, default False

Specifies whether or not whitespace (e.g. ' ' or ' ') will be used as the sep. Equivalent to setting sep='+s'. If this option is set to True, nothing should be passed in for the delimiter parameter.

This seems like the right argument. Use

pandas.read_csv(filepath_or_buffer, delim_whitespace=True).

Using delimiter='\t' should also work, unless the tabs are expanded (replaced by spaces). As we can't really tell delim_whitespace seems to be the better option.

If this doesn't help, just print out your sys.stdin to check if you properly pass the text.

Edit: I just saw that you use

cat /root/lab/u.data | python /root/lab/mid-1-mapper.py |python /root/lab/mid-1-reducer.py

Is this intended, this way mid-1-reducer.py processes the output of mid-1-mapper.py. If you want to process the content of the file u.data consider reading the file and not sys.stdin.

Comments