Tim Tim - 2 months ago 27
Python Question

Load CSV Strings With Different Types into Pandas Dataframe, Split Columns, Parse Date

I have two questions concerning a large csv file which contains data in the following way formatted as strings:

"XAU=,XAU=,XAG=,XAG="
"25/08/2014 6:00:05,1200.343,25/08/2014 6:00:03,19.44,"
"25/08/2014 6:00:05,1200,,,"


Is there a way to efficiently load this into a pandas dataframe object? Alternatively, also multiple pandas Series objects would do the job. So far I tried:

df = read_csv(path, header=None)
df[0].str[0:-1].str.split(',', return_type='frame')


The second line is an answer from this thread pandas split string into columns. However, I wonder if there is an even better way, especially since I have different data types? Secondly, how can I correctly parse the dates with
to_datetime()
. I tried to reindex
df
and used
df.index = df.index.to_datetime()
. This worked only half way because it did not strictly keep the
dd/mm/yyyy ...
format. Some dates were incorrectly parsed as
mm/dd/yyyy ...
. I'm looking for fast ways because eventually I will loop over many such csv's. Thx for any help!

EDIT: Ideally data in this form should be handled as well:

"XAU=,XAU=,XAG=,XAG="
"25/08/2014 6:00:05,1200.343,25/08/2014 6:00:03,19.44,"
",,25/08/2014 6:00:05,19.50,"

Answer

You can preprocess everything inside the read_csv as shown:

import csv

data = StringIO(
'''
"XAU=,XAU=,XAG=,XAG="  
"25/08/2014 6:00:05,1200.343,25/08/2014 6:00:03,19.44,"            
"25/08/2014 6:00:05,1200,,,"
''')

df = pd.read_csv(data, quoting=csv.QUOTE_NONE, index_col=False, escapechar='"',   \
                 parse_dates=[0, 2]).rename(columns=lambda x: x.split("=")[0])
df

Image

df.dtypes

XAU    datetime64[ns]
XAU           float64
XAG    datetime64[ns]
XAG           float64
dtype: object

Break-up:

quoting=csv.QUOTE_NONE : Instructs writer objects to never quote fields

index_col=False: Do not use the first column as the index

escapechar=": string used to escape delimiter

parse_dates=[0, 2]: convert columns(0,2 → order) to datetime objects


To read a subset of the columns, you can do so with the help of usecols by supplying appropriate integer indices as shown:

df = pd.read_csv(data, quoting=csv.QUOTE_NONE, index_col=False, escapechar='"',   \
                 parse_dates=[0], usecols=[0,1]).rename(columns=lambda x: x.split("=")[0])

df

Image

If you want to convert the two columns of XAU into a series object:

df.columns = df.columns + [str('_%d'%(i)) for i in list(range(len(df.columns)))]

ser = pd.Series(data=df['XAU_1'].values, index=df['XAU_0'].values, name='XAU')
ser

2014-08-25 06:00:05    1200.343
2014-08-25 06:00:05    1200.000
Name: XAU, dtype: float64

type(ser)
pandas.core.series.Series