plperez plperez - 11 months ago 56
R Question

Combine dataframes with common column names but different values in columns

I think I ran into a situation that (to the best of my knowledge) is not fully covered by the awesome

library, so I guess it will need a bit more of coding than what I am capable of. I have the following 2 data frames:

df1 =
Col1 Col2 Col3
1 A 1 X
2 A 1 X
3 B 1 X
4 C 1 X
5 D 1 Y
6 D 1 Z

df2 =

Col1 Col2 Col3
1 A 2 X
2 B 2 Y
3 C 2 Y
4 G 2 Z
5 H 2 X
6 I 2 Z

I want only the rows which has common elements from
only, this is:

out =
Col1 Col2 Col3
1 A 1 X
2 A 1 X
3 B 1 X
4 C 1 X
5 A 2 X
6 B 2 Y
7 C 2 Y

It looks like
would do it, but since
have different values, it gives me table with 0 values. Your guidance is much appreciated. Thanks. P. Perez.

Answer Source

With base R you could do:

common <- intersect(df1$Col1, df2$Col1)

df3 <- rbind(df1, df2)
df3[df3$Col1 %in% common, ]

which gives:

   Col1 Col2 Col3
1     A    1    X
2     A    1    X
3     B    1    X
4     C    1    X
11    A    2    X
21    B    2    Y
31    C    2    Y

And with dplyr:

bind_rows(df1, df2) %>% 
  filter(Col1 %in% intersect(df1$Col1, df2$Col1))

which will give you the same output. An alternative by @Frank from the comments:

bind_rows(df1, df2, .id = "id") %>% 
  group_by(Col1) %>% 
  filter(n_distinct(id) == 2L)

The logic behind this is that you bind the two dataframes together and include an id-column simultaniously with the .id-parameter. Then group by the values of Col1 and check how much unique id's there are for each value. The ones with only one unique id, don't appear in both dataframes.

A similar logic can be applied with the data.table package:

rbindlist(list(df1, df2), idcol = 'id')[, if (uniqueN(id) == 2L) .SD, by = Col1]