sohil sohil - 29 days ago 12
Python Question

Pandas parser CParseError

I am using pandas to read a csv file. I am getting this error:

File "antifraud.py", line 11, in <module>
df = pd.read_csv(trainFilePath, names=['time', 'id1', 'id2', 'amount', 'message'])
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 470, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 256, in _read
return parser.read()
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 715, in read
ret = self._engine.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 1164, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 758, in pandas.parser.TextReader.read (pandas/parser.c:7411)
File "pandas/parser.pyx", line 780, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7651)
File "pandas/parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas/parser.c:8268)
File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142)
File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728)
pandas.parser.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.


When I tried to read it using open csv as:

import csv
with open(filepath, 'r') as f:
reader = csv.reader(f)
linenumber = 1
try:
for row in reader:
linenumber += 1
except Exception as e:
print (("Error line %d: %s %s" % (linenumber, str(type(e)), e.message)))


I saw that the error was at a particular line. The line is:

2016-11-02 09:45:43, 10244, 26248, 20.06, 提供一天 我充滿


My question is that is the data in the file having some escape characters like '\r','\n' or is it because maybe pandas cannot read Chinese as I haven't mentioned an encoding method, or is it something else?

Answer

There is a reported issue and as a solution one can try the following solution provided by @chris-b1:

pd.read_csv(open(trainFilePath,'rU'), encoding='utf-8')