CFM CFM - 9 days ago 5
Python Question

Apply function to every row in Pandas Dataframe

I am new to Python and would like to rebuild this example. I have longitude and latitude data about NYC Taxi pick-ups and drop-offs, however, I need to change the data to the Web Mercartor format (this cannot be found in the example above).
I found a function which can take one pair of longitude and latitude values and change it to Web Mercartor format, which was taken from here, it looks as follows:

import math
def toWGS84(xLon, yLat):
# Check if coordinate out of range for Latitude/Longitude
if (abs(xLon) < 180) and (abs(yLat) > 90):
return

# Check if coordinate out of range for Web Mercator
# 20037508.3427892 is full extent of Web Mercator
if (abs(xLon) > 20037508.3427892) or (abs(yLat) > 20037508.3427892):
return

semimajorAxis = 6378137.0 # WGS84 spheriod semimajor axis

latitude = (1.5707963267948966 - (2.0 * math.atan(math.exp((-1.0 * yLat) / semimajorAxis)))) * (180/math.pi)
longitude = ((xLon / semimajorAxis) * 57.295779513082323) - ((math.floor((((xLon / semimajorAxis) * 57.295779513082323) + 180.0) / 360.0)) * 360.0)

return [longitude, latitude]



def toWebMercator(xLon, yLat):
# Check if coordinate out of range for Latitude/Longitude
if (abs(xLon) > 180) and (abs(yLat) > 90):
return

semimajorAxis = 6378137.0 # WGS84 spheriod semimajor axis
east = xLon * 0.017453292519943295
north = yLat * 0.017453292519943295

northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
easting = semimajorAxis * east

return [easting, northing]

def main():
print(toWebMercator(-105.816001, 40.067633))
print(toWGS84(-11779383.349100526, 4875775.395628653))

if __name__ == '__main__':
main()


How do I apply this data to every pair of long/lat coordinates in my pandas Dataframe and save the output in the same pandasDF?

df.tail()
| longitude | latitude
____________|__________________|______________
11135465 | -73.986893 | 40.761093
1113546 | -73.979645 | 40.747814
11135467 | -74.001244 | 40.743172
11135468 | -73.997818 | 40.726055
...

Answer

With a dataset that size, what would help you the most is understanding how to do things the pandas way. Iterating over rows will yield terrible performance compared to the built in vectorized methods.

import pandas as pd
import numpy as np

df = pd.read_csv('/yellow_tripdata_2016-06.csv')
df.head(5)

VendorID    tpep_pickup_datetime    tpep_dropoff_datetime   passenger_count trip_distance   pickup_longitude    pickup_latitude RatecodeID  store_and_fwd_flag  dropoff_longitude   dropoff_latitude    payment_type    fare_amount extra   mta_tax tip_amount  tolls_amount    improvement_surcharge   total_amount
0   2   2016-06-09 21:06:36 2016-06-09 21:13:08 2   0.79    -73.983360  40.760937   1   N   -73.977463  40.753979   2   6.0 0.5 0.5 0.00    0.0 0.3 7.30
1   2   2016-06-09 21:06:36 2016-06-09 21:35:11 1   5.22    -73.981720  40.736668   1   N   -73.981636  40.670242   1   22.0    0.5 0.5 4.00    0.0 0.3 27.30
2   2   2016-06-09 21:06:36 2016-06-09 21:13:10 1   1.26    -73.994316  40.751072   1   N   -74.004234  40.742168   1   6.5 0.5 0.5 1.56    0.0 0.3 9.36
3   2   2016-06-09 21:06:36 2016-06-09 21:36:10 1   7.39    -73.982361  40.773891   1   N   -73.929466  40.851540   1   26.0    0.5 0.5 1.00    0.0 0.3 28.30
4   2   2016-06-09 21:06:36 2016-06-09 21:23:23 1   3.10    -73.987106  40.733173   1   N   -73.985909  40.766445   1   13.5    0.5 0.5 2.96    0.0 0.3 17.76

This dataset has 11,135,470 rows, which isn't "big data," but isn't small. Rather than writing a function and applying it to every row, you'll get a lot more performance by performing parts of the function to individual columns. I would turn this function:

def toWebMercator(xLon, yLat):
    # Check if coordinate out of range for Latitude/Longitude
    if (abs(xLon) > 180) and (abs(yLat) > 90):
        return

    semimajorAxis = 6378137.0  # WGS84 spheriod semimajor axis
    east = xLon * 0.017453292519943295
    north = yLat * 0.017453292519943295

    northing = 3189068.5 * math.log((1.0 + math.sin(north)) / (1.0 - math.sin(north)))
    easting = semimajorAxis * east

    return [easting, northing]

into this:

SEMIMAJORAXIS = 6378137.0 # typed in all caps since this is a static value
df['pickup_east'] = df['pickup_longitude'] * 0.017453292519943295 # takes all pickup longitude values, multiples them, then saves as a new column named pickup_east.
df['pickup_north'] = df['pickup_latitude'] * 0.017453292519943295
# numpy functions allow you to calculate an entire column's worth of values by simply passing in the column. 
df['pickup_northing'] = 3189068.5 * np.log((1.0 + np.sin(df['pickup_north'])) / (1.0 - np.sin(df['pickup_north']))) 
df['pickup_easting'] = SEMIMAJORAXIS * df['pickup_east']

You then have pickup_easting and pickup_northing columns with the calculated values.

For my laptop, this takes:

CPU times: user 1.01 s, sys: 286 ms, total: 1.3 s
Wall time: 763 ms

For all 11m rows. 15 minutes --> seconds.

I got rid of the checks on the values- you could do something like:

df = df[(df['pickup_longitude'].abs() <= 180) & (df['pickup_latitude'].abs() <= 90)]

This uses boolean indexing, which again, is orders of magnitude faster than looping.