N1B4 N1B4 - 7 months ago 125
Python Question

Pandas read csv with multiple whitespaces and parse dates

I have a csv file that looks like

Year Mo Da (01,52)
1950 1 1 0.00
1950 1 2 0.00
1950 1 3 0.05
1950 1 4 0.00
1950 1 5 0.07
1950 1 6 0.07


and I would like transform it into a dataframe with 2 columns: a datetime column of YYYYMMDD (using the "Year", "Mo", and "Da" columns in the raw data) and then the rainfall at the grid point (e.g. 01, 52) as the second column.

A desired output would be:

Datetime Rainfall
19500101 0.00
19500102 0.00
19500103 0.05


I am stuck on two issues: appropriately accounting for the whitespace during the read-in and properly using
parse_dates
.

The simple read-in command:

df = pd.read_csv(csv_fl)


Almost correctly reads in the headers, but splits the (01,52) into separate columns, yielding a trailing NaN, which shouldn't be there.

Year Mo Da (01 52)
0 1950 1 1 0.00 NaN


And trying to parse the dates using

df = pd.read_csv(csv_fl, parse_dates={'Datetime':[0,1,2]}, index_col=0)


leads to an IndexError

colnames.append(str(columns[c]))
IndexError: list index out of range


Any guidance is much appreciated.

Answer

If you pass params delim_whitespace=True and pass the 3 columns in a list to parse_dates the last step is just to overwrite the column names:

In [96]:
import pandas as pd
import io
t="""Year Mo Da (01,52) 
1950  1    1    0.00    
1950  1    2    0.00    
1950  1    3    0.05    
1950  1    4    0.00    
1950  1    5    0.07    
1950  1    6    0.07"""
df =pd.read_csv(io.StringIO(t), delim_whitespace=True, parse_dates=[['Year','Mo','Da']])
df.columns = ['Datetime', 'Rainfall']
df

Out[96]:
    Datetime  Rainfall
0 1950-01-01      0.00
1 1950-01-02      0.00
2 1950-01-03      0.05
3 1950-01-04      0.00
4 1950-01-05      0.07
5 1950-01-06      0.07

So I expect: df = pd.read_csv(csv_fl, delim_whitespace=True, parse_dates=[['Year','Mo','Da']])

should work followed by overwriting the column names