lnNoam lnNoam - 4 months ago 43
Python Question

Speed up pandas dataframe lookup

I have a pandas data frame with zip codes, city, state and country of ~ 600,000 locations. Let's call it my_df

I'd like to look up the corresponding longitude and latitude for each of these locations. Thankfully, there is a database for this. Let's call this dataframe zipdb.

zipdb
has, among others, columns for zip codes, city, state and country.
So, I'd like to look up all of the locations (zip, city, state and country) in
zipdb
.

def zipdb_lookup(zipcode, city, state, country):

countries_mapping = { "UNITED STATES":"US"
, "CANADA":"CA"
, "KOREA REP OF":"KR"
, "ITALY":"IT"
, "AUSTRALIA":"AU"
, "CHILE":"CL"
, "UNITED KINGDOM":"GB"
, "BERMUDA":"BM"
}

try:
slc = zipdb[ (zipdb.Zipcode == str(zipcode)) &
(zipdb.City == str(city).upper()) &
(zipdb.State == str(state).upper()) &
(zipdb.Country == countries_mapping[country].upper()) ]

if slc.shape[0] == 1:
return np.array(slc["Lat"])[0], np.array(slc["Long"])[0]
else:
return None
except:
return None


I have tried pandas'
.apply
as well as a
for
loop to do this.
Both are very slow. I recognize there are a large number of rows, but I can't help but think something faster must be possible.




zipdb = pandas.read_csv("free-zipcode-database.csv") #linked to above


Note: I've also performed this transformation on
zibdb
:

zipdb["Zipcode"] = zipdb["Zipcode"].astype(str)





Function Call:

#Defined a wrapper function:
def lookup(row):
"""

:param row:
:return:
"""

lnglat = zipdb_lookup(
zipcode = my_df["organization_zip"][row]
, city = my_df["organization_city"][row]
, state = my_df["organization_state"][row]
, country = my_df["organization_country"][row]
)

return lnglat

lnglat = list()
for l in range(0, my_df.shape[0]):
# if l % 5000 == 0: print(round((float(l) / my_df.shape[0])*100, 2), "%")
lnglat.append(lookup(row = l))





Sample Data from
my_df
:

organization_zip organization_city organization_state organization_country
0 60208 EVANSTON IL United Sates
1 77555 GALVESTON TX United Sates
2 23284 RICHMOND VA United Sates
3 53233 MILWAUKEE WI United Sates
4 10036 NEW YORK NY United Sates
5 33620 TAMPA FL United Sates
6 10029 NEW YORK NY United Sates
7 97201 PORTLAND OR United Sates
8 97201 PORTLAND OR United Sates
9 53715 MADISON WI United Sates

Answer

Using merge() will be a lot faster than calling a function on every row. Make sure the field types match and strings are stripped:

# prepare your dataframe
data['organization_zip'] = data.organization_zip.astype(str)
data['organization_city'] = data.organization_city.apply(lambda v: v.strip())
# get the zips database
zips = pd.read_csv('/path/to/free-zipcode-database.csv')
zips['Zipcode'] = zips.Zipcode.astype(str)
# left join
# -- prepare common join columns
zips.rename(columns=dict(Zipcode='organization_zip',
                         City='organization_city'), 
            inplace=True)  
# specify join columns along with zips' columns to copy
cols = ['organization_zip', 'organization_city', 'Lat', 'Long']
data.merge(zips[cols], how='left')
=> 

result

Note you may need to extend the merge columns and/or add more columns to copy from the zips dataframe.

Comments