lnNoam lnNoam - 1 year ago 138
Python Question

Speed up pandas dataframe lookup

I have a pandas data frame with zip codes, city, state and country of ~ 600,000 locations. Let's call it my_df

I'd like to look up the corresponding longitude and latitude for each of these locations. Thankfully, there is a database for this. Let's call this dataframe zipdb.

has, among others, columns for zip codes, city, state and country.
So, I'd like to look up all of the locations (zip, city, state and country) in

def zipdb_lookup(zipcode, city, state, country):

countries_mapping = { "UNITED STATES":"US"
, "ITALY":"IT"
, "CHILE":"CL"

slc = zipdb[ (zipdb.Zipcode == str(zipcode)) &
(zipdb.City == str(city).upper()) &
(zipdb.State == str(state).upper()) &
(zipdb.Country == countries_mapping[country].upper()) ]

if slc.shape[0] == 1:
return np.array(slc["Lat"])[0], np.array(slc["Long"])[0]
return None
return None

I have tried pandas'
as well as a
loop to do this.
Both are very slow. I recognize there are a large number of rows, but I can't help but think something faster must be possible.

zipdb = pandas.read_csv("free-zipcode-database.csv") #linked to above

Note: I've also performed this transformation on

zipdb["Zipcode"] = zipdb["Zipcode"].astype(str)

Function Call:

#Defined a wrapper function:
def lookup(row):

:param row:

lnglat = zipdb_lookup(
zipcode = my_df["organization_zip"][row]
, city = my_df["organization_city"][row]
, state = my_df["organization_state"][row]
, country = my_df["organization_country"][row]

return lnglat

lnglat = list()
for l in range(0, my_df.shape[0]):
# if l % 5000 == 0: print(round((float(l) / my_df.shape[0])*100, 2), "%")
lnglat.append(lookup(row = l))

Sample Data from

organization_zip organization_city organization_state organization_country
0 60208 EVANSTON IL United Sates
1 77555 GALVESTON TX United Sates
2 23284 RICHMOND VA United Sates
3 53233 MILWAUKEE WI United Sates
4 10036 NEW YORK NY United Sates
5 33620 TAMPA FL United Sates
6 10029 NEW YORK NY United Sates
7 97201 PORTLAND OR United Sates
8 97201 PORTLAND OR United Sates
9 53715 MADISON WI United Sates

Answer Source

Using merge() will be a lot faster than calling a function on every row. Make sure the field types match and strings are stripped:

# prepare your dataframe
data['organization_zip'] = data.organization_zip.astype(str)
data['organization_city'] = data.organization_city.apply(lambda v: v.strip())
# get the zips database
zips = pd.read_csv('/path/to/free-zipcode-database.csv')
zips['Zipcode'] = zips.Zipcode.astype(str)
# left join
# -- prepare common join columns
# specify join columns along with zips' columns to copy
cols = ['organization_zip', 'organization_city', 'Lat', 'Long']
data.merge(zips[cols], how='left')


Note you may need to extend the merge columns and/or add more columns to copy from the zips dataframe.