abalter abalter - 2 months ago 28
Python Question

Dealing with missing data in Pandas read_csv

I have not found a satisfying solution to the problem of missing data when importing CSV data into a pandas DataFrame.

I have datasets where I don't know in advance what the columns or data types are. I would like pandas to do a better job inferring how to read in the data.

I haven't found any combination of

na_values=...
that really helps.

Consider the following csv files:

no_holes.csv

letter,number
a,1
b,2
c,3
d,4


with_holes.csv

letter,number
a,1
,2
b,
,4


empty_column.csv

letters,numbers
,1
,2
,3
,4


with_NA.csv

letter,number
a,1
b,NA
NA,3
d,4


Here is what happens when I read them into a DataFrame (code below):

**no holes**
letter number
0 a 1
1 b 2
2 c 3
3 d 4
letter object
number int64
dtype: object

**with holes**
letter number
0 a 1
1 NaN 2
2 b
3 NaN 4
letter object
number object
dtype: object

**empty_column**
letters numbers
0 NaN 1
1 NaN 2
2 NaN 3
3 NaN 4
letters float64
numbers int64
dtype: object

**with NA**
letter number
0 a 1.0
1 b NaN
2 NaN 3.0
3 d 4.0
letter object
number float64
dtype: object


Is there a way to tell pandas to assume empty values are of
object
type? I've tried
na_values=[""]
.

demo_holes.py

import pandas as pd

with_holes = pd.read_csv("with_holes.csv")
no_holes = pd.read_csv("no_holes.csv")
empty_column = pd.read_csv("empty_column.csv")
with_NA = pd.read_csv("with_NA.csv")

print("\n**no holes**")
print(no_holes.head())
print(no_holes.dtypes)
print("\n**with holes**")
print(with_holes.head())
print(with_holes.dtypes)
print("\n**empty_column**")
print(empty_column.head())
print(empty_column.dtypes)
print("\n**with NA**")
print(with_NA.head())
print(with_NA.dtypes)

Answer

you want to use the parameter skipinitialspace=True

setup

no_holes = """letter,number
a,1
b,2
c,3
d,4"""

with_holes = """letter,number
a,1
,2
b, 
,4"""

empty_column = """letters,numbers
,1
,2
,3
,4"""

with_NA = """letter,number
a,1
b,NA
NA,3
d,4"""

from StringIO import StringIO
import pandas as pd

d1 = pd.read_csv(StringIO(no_holes), skipinitialspace=True)
d2 = pd.read_csv(StringIO(with_holes), skipinitialspace=True)
d3 = pd.read_csv(StringIO(empty_column), skipinitialspace=True)
d4 = pd.read_csv(StringIO(with_NA), skipinitialspace=True)

pd.concat([d1, d2, d3, d4], axis=1,
          keys=['no_holes', 'with_holes',
                'empty_column', 'with_NA'])

enter image description here


if you want those NaNs to be '' then use fillna

d1 = pd.read_csv(StringIO(no_holes), skipinitialspace=True).fillna('')
d2 = pd.read_csv(StringIO(with_holes), skipinitialspace=True).fillna('')
d3 = pd.read_csv(StringIO(empty_column), skipinitialspace=True).fillna('')
d4 = pd.read_csv(StringIO(with_NA), skipinitialspace=True).fillna('')

pd.concat([d1, d2, d3, d4], axis=1,
          keys=['no_holes', 'with_holes',
                'empty_column', 'with_NA'])

enter image description here