CFM - 1 year ago 169
Python Question

# Apply function to every row in Pandas Dataframe

I am new to Python and would like to rebuild this example. I have longitude and latitude data about NYC Taxi pick-ups and drop-offs, however, I need to change the data to the Web Mercartor format (this cannot be found in the example above).
I found a function which can take one pair of longitude and latitude values and change it to Web Mercartor format, which was taken from here, it looks as follows:

``````import math
def toWGS84(xLon, yLat):
# Check if coordinate out of range for Latitude/Longitude
if (abs(xLon) < 180) and (abs(yLat) > 90):
return

# Check if coordinate out of range for Web Mercator
# 20037508.3427892 is full extent of Web Mercator
if (abs(xLon) > 20037508.3427892) or (abs(yLat) > 20037508.3427892):
return

semimajorAxis = 6378137.0  # WGS84 spheriod semimajor axis

latitude = (1.5707963267948966 - (2.0 * math.atan(math.exp((-1.0 * yLat) / semimajorAxis)))) * (180/math.pi)
longitude = ((xLon / semimajorAxis) * 57.295779513082323) - ((math.floor((((xLon / semimajorAxis) * 57.295779513082323) + 180.0) / 360.0)) * 360.0)

return [longitude, latitude]

def toWebMercator(xLon, yLat):
# Check if coordinate out of range for Latitude/Longitude
if (abs(xLon) > 180) and (abs(yLat) > 90):
return

semimajorAxis = 6378137.0  # WGS84 spheriod semimajor axis
east = xLon * 0.017453292519943295
north = yLat * 0.017453292519943295

northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
easting = semimajorAxis * east

return [easting, northing]

def main():
print(toWebMercator(-105.816001, 40.067633))
print(toWGS84(-11779383.349100526, 4875775.395628653))

if __name__ == '__main__':
main()
``````

How do I apply this data to every pair of long/lat coordinates in my pandas Dataframe and save the output in the same pandasDF?

``````df.tail()
|    longitude     |    latitude
____________|__________________|______________
11135465    |    -73.986893    |    40.761093
1113546     |    -73.979645    |    40.747814
11135467    |    -74.001244    |    40.743172
11135468    |    -73.997818    |    40.726055
...
``````

With a dataset that size, what would help you the most is understanding how to do things the `pandas` way. Iterating over rows will yield terrible performance compared to the built in vectorized methods.

``````import pandas as pd
import numpy as np

VendorID    tpep_pickup_datetime    tpep_dropoff_datetime   passenger_count trip_distance   pickup_longitude    pickup_latitude RatecodeID  store_and_fwd_flag  dropoff_longitude   dropoff_latitude    payment_type    fare_amount extra   mta_tax tip_amount  tolls_amount    improvement_surcharge   total_amount
0   2   2016-06-09 21:06:36 2016-06-09 21:13:08 2   0.79    -73.983360  40.760937   1   N   -73.977463  40.753979   2   6.0 0.5 0.5 0.00    0.0 0.3 7.30
1   2   2016-06-09 21:06:36 2016-06-09 21:35:11 1   5.22    -73.981720  40.736668   1   N   -73.981636  40.670242   1   22.0    0.5 0.5 4.00    0.0 0.3 27.30
2   2   2016-06-09 21:06:36 2016-06-09 21:13:10 1   1.26    -73.994316  40.751072   1   N   -74.004234  40.742168   1   6.5 0.5 0.5 1.56    0.0 0.3 9.36
3   2   2016-06-09 21:06:36 2016-06-09 21:36:10 1   7.39    -73.982361  40.773891   1   N   -73.929466  40.851540   1   26.0    0.5 0.5 1.00    0.0 0.3 28.30
4   2   2016-06-09 21:06:36 2016-06-09 21:23:23 1   3.10    -73.987106  40.733173   1   N   -73.985909  40.766445   1   13.5    0.5 0.5 2.96    0.0 0.3 17.76
``````

This dataset has 11,135,470 rows, which isn't "big data," but isn't small. Rather than writing a function and applying it to every row, you'll get a lot more performance by performing parts of the function to individual columns. I would turn this function:

``````def toWebMercator(xLon, yLat):
# Check if coordinate out of range for Latitude/Longitude
if (abs(xLon) > 180) and (abs(yLat) > 90):
return

semimajorAxis = 6378137.0  # WGS84 spheriod semimajor axis
east = xLon * 0.017453292519943295
north = yLat * 0.017453292519943295

northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
easting = semimajorAxis * east

return [easting, northing]
``````

into this:

``````SEMIMAJORAXIS = 6378137.0 # typed in all caps since this is a static value
df['pickup_east'] = df['pickup_longitude'] * 0.017453292519943295 # takes all pickup longitude values, multiples them, then saves as a new column named pickup_east.
df['pickup_north'] = df['pickup_latitude'] * 0.017453292519943295
# numpy functions allow you to calculate an entire column's worth of values by simply passing in the column.
df['pickup_northing'] = 3189068.5 * np.log((1.0 + np.sin(df['pickup_north'])) / (1.0 - np.sin(df['pickup_north'])))
df['pickup_easting'] = SEMIMAJORAXIS * df['pickup_east']
``````

You then have `pickup_easting` and `pickup_northing` columns with the calculated values.

For my laptop, this takes:

``````CPU times: user 1.01 s, sys: 286 ms, total: 1.3 s
Wall time: 763 ms
``````

For all 11m rows. 15 minutes --> seconds.

I got rid of the checks on the values- you could do something like:

``````df = df[(df['pickup_longitude'].abs() <= 180) & (df['pickup_latitude'].abs() <= 90)]
``````

This uses boolean indexing, which again, is orders of magnitude faster than looping.