mezz mezz - 1 year ago 132
Python Question

Importing text file : No Columns to parse from file

I am trying to take input from sys.stdin. This is a map reducer program for hadoop. Input file is in txt form. Preview of the data set:

196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
286 1014 5 879781125
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
291 1042 4 874834944


Code that I have been trying -

import sys
df = pd.read_csv(sys.stdin,error_bad_lines=False)


I have also tried with
delimiter = \t, header=False,defining column name

Nothing seems to work, the error I am getting is this error:

[root@sandbox lab]# cat /root/lab/u.data | python /root/lab/mid-1-mapper.py |python /root/lab/mid-1-reducer.py
Traceback (most recent call last):
File "/root/lab/mid-1-reducer.py", line 8, in <module>
df = pd.read_csv(sys.stdin,delimiter='\t')
File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 388, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 729, in __init__
self._make_engine(self.engine)
File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 922, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1389, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 538, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5896)
pandas.io.common.EmptyDataError: No columns to parse from file


However, if when I try this directly in python(not in hadoop), it works fine.

I have tried to looked into stackoverflow posts, one of the post suggested try and except. Applying that leaves me with a empty file.
Can anybody help? Thanks

Answer Source

Using try and except just lets you continue in spite of errors and handle them. It won't wont magically fix your errors.

read_csv expects csv files, which your input is obviously not. A quick look into the documentation:

delim_whitespace : boolean, default False

Specifies whether or not whitespace (e.g. ' ' or ' ') will be used as the sep. Equivalent to setting sep='+s'. If this option is set to True, nothing should be passed in for the delimiter parameter.

This seems like the right argument. Use

pandas.read_csv(filepath_or_buffer, delim_whitespace=True).

Using delimiter='\t' should also work, unless the tabs are expanded (replaced by spaces). As we can't really tell delim_whitespace seems to be the better option.

If this doesn't help, just print out your sys.stdin to check if you properly pass the text.

Edit: I just saw that you use

cat /root/lab/u.data | python /root/lab/mid-1-mapper.py |python /root/lab/mid-1-reducer.py

Is this intended, this way mid-1-reducer.py processes the output of mid-1-mapper.py. If you want to process the content of the file u.data consider reading the file and not sys.stdin.