CraigO CraigO - 2 years ago 245
Python Question

Using numpy.genfromtxt to read a csv file with strings containing commas

I am trying to read in a csv file with

but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':

2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0

the code

np.genfromtxt('t.csv', delimiter=',')

produces the error:

ValueError: Some errors were detected !
Line #2 (got 4 columns instead of 3)

The data structure I am looking for is:

array([['2012', 'Louisville KY', '3.5'],
['2011', 'Lexington, KY', '4.0']],

Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the
module and then convert it to a numpy array?

Answer Source

You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv can handle this. From the docs:

quotechar : string

The character to used to denote the start and end of a quoted item. Quoted items 
can include the delimiter and it will be ignored.

The default value is ". An example:

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s="""year, city, value
   ...: 2012, "Louisville KY", 3.5
   ...: 2011, "Lexington, KY", 4.0"""

In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
   year           city  value
0  2012  Louisville KY    3.5
1  2011  Lexington, KY    4.0

The trick here is that you also have to use skipinitialspace=True to deal with the spaces after the comma-delimiter.

Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download