Ripster Ripster - 13 days ago 5
Python Question

Pandas read_csv ignoring column dtypes when I pass skip_footer arg

When I try to import a csv file into a dataframe pandas (0.13.1) is ignoring the dtype parameter. Is there a way to stop pandas from inferring the data type on its own?

I am merging several CSV files and sometimes the customer contains letters and pandas imports as a string. When I try to merge the two dataframes I get an error because I'm trying to merge two different types. I need everything stored as strings.

Data snippet:

|WAREHOUSE|ERROR|CUSTOMER|ORDER NO|
|---------|-----|--------|--------|
|3615 | |03106 |253734 |
|3615 | |03156 |290550 |
|3615 | |03175 |262207 |
|3615 | |03175 |262207 |
|3615 | |03175 |262207 |
|3615 | |03175 |262207 |
|3615 | |03175 |262207 |
|3615 | |03175 |262207 |
|3615 | |03175 |262207 |


Import line:

df = pd.read_csv("SomeFile.csv",
header=1,
skip_footer=1,
usecols=[2, 3],
dtype={'ORDER NO': str, 'CUSTOMER': str})


df.dtypes
outputs this:

ORDER NO int64
CUSTOMER int64
dtype: object

Answer

Pandas 0.13.1 silently ignored the dtype argument because the c engine does not support skip_footer. This caused Pandas to fall back to the python engine which does not support dtype.

Solution? Use converters

df = pd.read_csv('SomeFile.csv', 
                 header=1,
                 skip_footer=1, 
                 usecols=[2, 3], 
                 converters={'CUSTOMER': str, 'ORDER NO': str},
                 engine='python')

Output:

In [1]: df.dtypes
Out[2]:
CUSTOMER    object
ORDER NO    object
dtype: object

In [3]: type(df['CUSTOMER'][0])
Out[4]: str

In [5]: df.head()
Out[6]:
  CUSTOMER ORDER NO
0    03106   253734
1    03156   290550
2    03175   262207
3    03175   262207
4    03175   262207

Leading 0's from the original file are preserved and all data is stored as strings.