Sairam Reddy Sairam Reddy - 1 month ago 8
R Question

Merge two data frames by a max number condition in r

Cheers, I have a data frame df1 with the Major City with max visitors in 2011.

df1:
Country City Visitors_2011
UK London 100000
USA Washington D.C 200000
USA New York 100000
France Paris 100000


The other data frame df2 consists of top visited cities in the country for 2012:

df2:
Country City Visitors_2012
USA Washington D.C 200000
USA New York 100000
USA Las Angeles 100000
UK London 100000
UK Manchester 100000
France Paris 100000
France Nice 100000


The Output I would need is:
Logic: To obtain df3, merge df1 and df2 by Country and City and if you can't find city in df1 then add that volume to biggest city in df1.
Example: Los Angeles visitor count here is added to Washington D.C because Los Angeles is not present in df1 and Washington D.C has more visitors(2012) than New York.

df3:
Country City Visitors_2011 Visitors_2012
UK London 100000 200000
USA Washington D.C 200000 300000
USA New York 100000 100000
France Paris 100000 200000


Can anyone point me to the right direction?

Answer

A dplyr approach:

library(dplyr)
max.cities <- df1 %>% group_by(Country) %>% summarise(City = City[which.max(Visitors_2011)])
result <- df2 %>% mutate(City=ifelse(City %in% df1$City, City,
                                     max.cities$City[match(Country, max.cities$Country)])) %>%
                  group_by(Country,City) %>%
                  summarise(Visitors_2012=sum(Visitors_2012)) %>% 
                  left_join(df1,., by=c("Country", "City"))

Notes:

  1. First, compute the City that has the max visitors group_by Country in df1 and set that to a separate data frame max.cities.
  2. mutate the City column in df2 so that if the City is in df1, then the name is unchanged; otherwise, the City from max.cites that matches the Country is used.
  3. Once the City has been suitably modified, group_by both Country and City and sum up the Visitors_2012.
  4. Finally, left_join with df1 by c("Country", "City") to get the final result.

The result using your posted data is as expected:

print(result)
##  Country           City Visitors_2011 Visitors_2012
##1      UK         London        100000        200000
##2     USA Washington D.C        200000        300000
##3     USA       New York        100000        100000
##4  France          Paris        100000        200000
Comments