sparrow sparrow - 3 months ago 7
Python Question

Set datatype after converting null values while reading from csv to DataFrame with Pandas

I have a .csv file with GPS data which looks like this:

ID,GPS_LATITUDE,GPS_LONGITUDE
1,35.66727683,139.7591279
2,35.66727683,139.7591279
3,-1,-1
4,35.66750697,139.7589757
5,,139.7589757


The last row has a blank or "null" value. I would like to read the data into a dataframe and set the null value to -1 and also read the data in as type float. With my code the data type is set to string and the null value is not substituted.

How I'm trying to do it (wrong):

data = r'c:\temp\gps.csv'

def conv(val):
if val == np.nan:
return -1
return val

df = pd.read_csv(data,converters={'GPS_LATITUDE':conv,'GPS_LONGITUDE':conv},dtype={'GPS_LATITUDE':np.float64,'GPS_LONGITUDE':np.float64})


Code to test output:

lats = df['GPS_LATITUDE'].tolist()
for l in lats:
print(l,type(l))
df


Output:

35.66727683 <class 'str'>
35.66727683 <class 'str'>
-1 <class 'str'>
35.66750697 <class 'str'>
<class 'str'>
Out[63]:
ID GPS_LATITUDE GPS_LONGITUDE
0 1 35.66727683 139.7591279
1 2 35.66727683 139.7591279
2 3 -1 -1
3 4 35.66750697 139.7589757
4 5 139.7589757

Answer

First of all, you don't even need to use any conv function:

$ cat /tmp/a.csv
ID,GPS_LATITUDE,GPS_LONGITUDE
1,35.66727683,139.7591279
2,35.66727683,139.7591279
3,-1,-1
4,35.66750697,139.7589757
5,,139.7589757

In [15]: df = pd.read_csv("/tmp/a.csv", dtype={'GPS_LATITUDE':np.float64,'GPS_LONGITUDE':np.float64})

In [16]: df
Out[16]: 
   ID  GPS_LATITUDE  GPS_LONGITUDE
0   1     35.667277     139.759128
1   2     35.667277     139.759128
2   3     -1.000000      -1.000000
3   4     35.667507     139.758976
4   5           NaN     139.758976

In [18]: df.dtypes
Out[18]: 
ID                 int64
GPS_LATITUDE     float64
GPS_LONGITUDE    float64
dtype: object

In [19]: df.fillna(-1, inplace = True)

In [20]: df
Out[20]: 
   ID  GPS_LATITUDE  GPS_LONGITUDE
0   1     35.667277     139.759128
1   2     35.667277     139.759128
2   3     -1.000000      -1.000000
3   4     35.667507     139.758976
4   5     -1.000000     139.758976

Second, if you do want to use conv, change it to (also, if you are using conv for all columns, then no need to specify dtype):

In [21]: def conv(val):
   ....:     if not val:
   ....:         return -1
   ....:     return np.float64(val)
   ....: 

In [22]: df = pd.read_csv("/tmp/a.csv", converters={'GPS_LATITUDE':conv,'GPS_LONGITUDE':conv})

In [23]: df
Out[23]: 
   ID  GPS_LATITUDE  GPS_LONGITUDE
0   1     35.667277     139.759128
1   2     35.667277     139.759128
2   3     -1.000000      -1.000000
3   4     35.667507     139.758976
4   5     -1.000000     139.758976

In [24]: df.dtypes
Out[24]: 
ID                 int64
GPS_LATITUDE     float64
GPS_LONGITUDE    float64
dtype: object

In either case:

In [26]: lats = df['GPS_LATITUDE'].tolist()

In [27]: for l in lats:
   ....:     print(l,type(l))
   ....:     
(35.667276829999999, <type 'numpy.float64'>)
(35.667276829999999, <type 'numpy.float64'>)
(-1.0, <type 'numpy.float64'>)
(35.667506969999998, <type 'numpy.float64'>)
(-1.0, <type 'numpy.float64'>)
Comments