user1917407 - 2 months ago 6x

Python Question

I'm working with a million-row CSV dataset that includes columns "latitude" and "longitude", and I want to create a new column based on that called "state", which is the US state that contains those coordinates.

`import pandas as pd`

import numpy as np

import os

from uszipcode import ZipcodeSearchEngine

def convert_to_state(coord):

lat, lon = coord["latitude"], coord["longitude"]

res = search.by_coordinate(lat, lon, radius=1, returns=1)

state = res.State

return state

def get_state(path):

with open(path + "USA_downloads.csv", 'r+') as f:

data = pd.read_csv(f)

data["state"] = data.loc[:, ["latitude", "longitude"]].apply(convert_to_state, axis=1)

get_state(path)

I keep getting an error "DtypeWarning: Columns (4,5) have mixed types. Specify dtype option on import or set low_memory=False." Columns 4 and 5 correspond to the latitude and longitude. I don't understand how I would use .apply to complete this task, or if .apply is even the right method for the job. How should I proceed?

Answer

I believe this will be a faster implementation of your program:

```
import pandas as pd
import numpy as np
import os
from uszipcode import ZipcodeSearchEngine
def convert_to_state(lat, lon):
res = search.by_coordinate(lat, lon, radius=1, returns=1)
state = res.State
return state
def get_state(path):
with open(path + "USA_downloads.csv", 'r+') as f:
data = pd.read_csv(f)
data["state"] = np.vectorize(convert_to_state)(data["latitude"].values, data["longitude"].values)
get_state(path)
```

It uses `numpy.vectorize`

to speed things up a little (although it is still a loop), and then calls the function with the values obtained from the `'latitude'`

and `'longitude'`

columns of your DataFrame, converted to `numpy.ndarray`

(the `.values`

attribute does that).

If you want to keep using `.apply()`

, you can do:

```
state = data.apply(lambda x: convert_to_state(x['latitude'], x['longitude']), axis=1)
data["state"] = state
```

Source (Stackoverflow)

Comments