Roman Roman - 1 year ago 52
Python Question

How to make separator in read_csv more flexible wrt whitespace?

I need to created a data frame using data stored in a file. For that I want to use

read_csv
method. However, the separator is not very regular. Some columns are separated by tabs (
\t
), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).

Is there a way to tell pandas to treat these files properly?

By the way, I do not have this problem if I use Python. I use:

for line in file(file_name):
fld = line.split()


And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?

DSM DSM
Answer Source

From the documentation, you can use either a regex or delim_whitespace:

>>> import pandas as pd
>>> for line in open("whitespace.csv"):
...     print repr(line)
...     
'a\t  b\tc 1 2\n'
'd\t  e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4