Sairam Reddy - 1 year ago 63
R Question

Merge two data frames by a max number condition in r

Cheers, I have a data frame df1 with the Major City with max visitors in 2011.

``````df1:
Country      City             Visitors_2011
UK           London           100000
USA          Washington D.C   200000
USA          New York         100000
France       Paris            100000
``````

The other data frame df2 consists of top visited cities in the country for 2012:

``````df2:
Country      City               Visitors_2012
USA          Washington D.C     200000
USA          New York           100000
USA          Las Angeles        100000
UK           London             100000
UK           Manchester         100000
France       Paris              100000
France       Nice               100000
``````

The Output I would need is:
Logic: To obtain df3, merge df1 and df2 by Country and City and if you can't find city in df1 then add that volume to biggest city in df1.
Example: Los Angeles visitor count here is added to Washington D.C because Los Angeles is not present in df1 and Washington D.C has more visitors(2012) than New York.

``````df3:
Country      City             Visitors_2011    Visitors_2012
UK           London           100000           200000
USA          Washington D.C   200000           300000
USA          New York         100000           100000
France       Paris            100000           200000
``````

Can anyone point me to the right direction?

Answer Source

A `dplyr` approach:

``````library(dplyr)
max.cities <- df1 %>% group_by(Country) %>% summarise(City = City[which.max(Visitors_2011)])
result <- df2 %>% mutate(City=ifelse(City %in% df1\$City, City,
max.cities\$City[match(Country, max.cities\$Country)])) %>%
group_by(Country,City) %>%
summarise(Visitors_2012=sum(Visitors_2012)) %>%
left_join(df1,., by=c("Country", "City"))
``````

Notes:

1. First, compute the `City` that has the `max` visitors `group_by` `Country` in `df1` and set that to a separate data frame `max.cities`.
2. `mutate` the `City` column in `df2` so that if the `City` is in `df1`, then the name is unchanged; otherwise, the `City` from `max.cites` that `match`es the `Country` is used.
3. Once the `City` has been suitably modified, `group_by` both `Country` and `City` and sum up the `Visitors_2012`.
4. Finally, `left_join` with `df1` by `c("Country", "City")` to get the final `result`.

The result using your posted data is as expected:

``````print(result)
##  Country           City Visitors_2011 Visitors_2012
##1      UK         London        100000        200000
##2     USA Washington D.C        200000        300000
##3     USA       New York        100000        100000
##4  France          Paris        100000        200000
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download