amormachine amormachine - 3 months ago 34
R Question

Purrr-Fection: In Search of An Elegant Solution to Conditional Data Frame Operations Leveraging Purrr

The Background

I have an issue for which a number of solution pathways are possible, but I am convinced there is an as-yet-undiscovered elegant solution leveraging purrr.

The Example Code

I have a large data frame as follows, for which I have included an example below:

library(tibble)
library(ggmap)
library(purrr)
library(dplyr)

# Define Example Data
df <- frame_data(
~Street, ~City, ~State, ~Zip, ~lon, ~lat,
"226 W 46th St", "New York", "New York", 10036, -73.9867, 40.75902,
"5th Ave", "New York", "New York", 10022, NA, NA,
"75 Broadway", "New York", "New York", 10006, -74.01205, 40.70814,
"350 5th Ave", "New York", "New York", 10118, -73.98566, 40.74871,
"20 Sagamore Hill Rd", "Oyster Bay", "New York", 11771, NA, NA,
"45 Rockefeller Plaza", "New York", "New York", 10111, -73.97771, 40.75915
)


The Challenge

I would like to geotag all locations for which the
lon
and
lat
columns are currently
NA
. There are many ways I could go about this, one of which is shown below:

# Safe Code is Great Code
safe_geocode <- safely(geocode)

# Identify Data to be Geotagged by Absence of lon and lat
data_to_be_geotagged <- df %>% filter(is.na(lon) | is.na(lat))

# GeoTag Addresses of Missing Data Points
fullAddress <- paste(data_to_be_geotagged$Street,
data_to_be_geotagged$City,
data_to_be_geotagged$State,
data_to_be_geotagged$Zip,
sep = ", ")

fullAddress %>%
map(safe_geocode) %>%
map("result") %>%
plyr::ldply()


The Question

While I can get the above to work, and even wrangle the newly identified
lon
and
lat
coordinates back into the original data frame, the whole scheme feels dirty. I am convinced there is an elegant way to leverage piping and purrr to go through the data-frame and conditionally geotag the locations based on the absence of
lon
and
lat
.

I have been down a number of rabbit holes including
purrr::pmap
in an attempt to walk through multiple columns in parallel when constructing the full address (As well as
rowwise()
and
by_row()
). Nevertheless, I fall short in constructing anything that would qualify as an elegant solution.

Any insight provided would be most appreciated.

Answer

Really, you want to avoid calling geocode any more than necessary because it's slow and if you're using Google, you only have 2500 queries per day. Thus, it's best to make both columns from the same call, which can be done with a list column, making a new version of the data.frame with do, or a self-join.


1. With a list column

With a list column, you make a new version of lon and lat with ifelse, geocoding if there are NAs, else just copying the existing values. Afterwards, get rid of the old versions of the columns and unnest the new ones:

library(dplyr)
library(ggmap)
library(tidyr)    # For `unnest`

       # Evaluate each row separately
df %>% rowwise() %>% 
    # Add a list column. If lon or lat are NA,
    mutate(data = ifelse(any(is.na(c(lon, lat))), 
                         # return a data.frame of the geocoded results,
                         list(geocode(paste(Street, City, State, Zip))), 
                         # else return a data.frame of existing columns.
                         list(data_frame(lon = lon, lat = lat)))) %>% 
    # Remove old columns
    select(-lon, -lat) %>% 
    # Unnest newly created ones from list column
    unnest(data)

## # A tibble: 6 × 6
##                 Street       City    State   Zip       lon      lat
##                  <chr>      <chr>    <chr> <dbl>     <dbl>    <dbl>
## 1        226 W 46th St   New York New York 10036 -73.98670 40.75902
## 2              5th Ave   New York New York 10022 -73.97491 40.76167
## 3          75 Broadway   New York New York 10006 -74.01205 40.70814
## 4          350 5th Ave   New York New York 10118 -73.98566 40.74871
## 5  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 6 45 Rockefeller Plaza   New York New York 10111 -73.97771 40.75915

2. With do

do, on the other hand, creates a wholly new data.frame from pieces of the old one. It requires slightly clunky $ notation, with . to represent the grouped data.frame piped in. Using if and else instead of ifelse lets you avoid nesting results in lists (which they had to be above, anyway).

       # Evaluate each row separately
df %>% rowwise() %>% 
    # Make a new data.frame from the first four columns and the geocode results or existing lon/lat
    do(bind_cols(.[1:4], if(any(is.na(c(.$lon, .$lat)))){
        geocode(paste(.[1:4], collapse = ' '))
    } else {
        .[5:6]
    }))

which returns exactly the same thing as the first version.


3. On a subset, recombining with a self-join

If the ifelse is overly confusing, you can just geocode a subset and then recombine by binding the rows to the anti_join, i.e. all the rows that are in df but not the subset .:

df %>% filter(is.na(lon) | is.na(lat)) %>% 
    select(1:4) %>% 
    bind_cols(geocode(paste(.$Street, .$City, .$State, .$Zip))) %>% 
    bind_rows(anti_join(df, ., by = c('Street', 'Zip')))

which returns the same thing, but with the newly geocoded rows at the top. The same approach works with a list column or do, but since there's no need to combine two sets of columns, just bind_cols will do the trick.


4. On a subset with mutate_geocode

ggmap actually includes a mutate_geocode function that will add lon and lat columns when passed a data.frame and a column of addresses. It has two issues:

  • surprisingly, it can't handle tibbles, and
  • it can't accept more than a column name for the address, and thus requires a single column with the entire address.

Thus, while this version could be quite nice, it requires a lot of munging just to make mutate_geocode work properly:

df %>% filter(is.na(lon) | is.na(lat)) %>% 
    select(1:4) %>% 
    mutate(address = paste(Street, City, State, Zip)) %>%    # make an address column
    as.data.frame() %>%    # convert from tibble to data.frame
    mutate_geocode(address) %>% 
    select(-address) %>%    # get rid of address column
    bind_rows(anti_join(df, ., by = c('Street', 'Zip')))

##                 Street       City    State   Zip       lon      lat
## 1              5th Ave   New York New York 10022 -73.97491 40.76167
## 2  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 3 45 Rockefeller Plaza   New York New York 10111 -73.97771 40.75915
## 4          350 5th Ave   New York New York 10118 -73.98566 40.74871
## 5          75 Broadway   New York New York 10006 -74.01205 40.70814
## 6        226 W 46th St   New York New York 10036 -73.98670 40.75902

All versions only call geocode twice.

Note that while you could use purrr for the job, it's not particularly better suited than regular dplyr. purrr excels at dealing with lists, and while a list column is one option, it doesn't really have to be manipulated.