Valilutzik Valilutzik - 1 month ago 9
Python Question

Confusing read_table error in pandas

I am trying to read the seeds dataset using pandas. When loading the file using:

df = pd.read_table("seeds_dataset.txt", header=None)


I get:

CParserError: Error tokenizing data. C error: Expected 8 fields in line 8, saw 10


Now, for loading the file with excel, I needed to specify tab and space as delimiters at the same time, to correctly read the file at that line 8, something that can't be done with pandas (as far as I know). Sublime Text reads the file accurately directly.

I don't want to skip the bad lines with
error_bad_lines
as there is nothing wrong with them. I used also
lineterminator
with no success.

Answer

try the option "delim_whitespace".

df = pd.read_table("seeds_dataset.txt", header=None, delim_whitespace = True) 

EDIT: more detailed explanation:

The method signature for read_table is here. It has all sorts of options, one of which is sep. This defines the delimiter between fields, and its default is '\t' (tab). One solution is to change the sep argument. The python implementation of the pandas parser lets you use regex delimiters, so sep = "\\s+" would delimit on any amount of whitespace. However, the C parser (which it looks like you're using from the error message) doesn't let you use regex. It does have the delim_whitespace option, though, which fit your needs exactly!