LeoBB LeoBB - 1 month ago 8
R Question

Editing a dataframe to create a paired sample; removing records without a matching date in another group

I have done a bunch of searching for a solution to this and either can't find one or don't know it when I see it. I've seen some topics that are close to this but deal with matching between two different dataframes, whereas this is dealing with a single dataframe.

I have a dataframe with two groups (factors, col1) and a sampling date (date, col2), and then the measurement (numeric, col3). I would like to eventually run a statistical test on a paired sample between group A and B, so in order to create the paired sample, I want to only keep the records that have a measurement taken on the same day for both groups. In other words, remove the records in group A that do not have a corresponding measurement taken on the same day in group B, and vice versa. In the sample data below, that would result in rows 4 and 8 being removed. Another way of thinking of it is, how do I search for and remove records with only one occurrence of each date?

Sample data:

my.df <- data.frame(col1 = as.factor(c(rep("A", 4), rep("B", 4))),
col2 = as.Date(c("2001-01-01", "2001-01-02", "2001-01-03",
"2001-01-04", "2001-01-01", "2001-01-02", "2001-01-03",
"2001-02-03")),
col3 = sample(8))

Answer

Here are a few alternatives:

1) ave

> subset(my.df, ave(col3, col2, FUN = length) > 1)
  col1       col2 col3
1    A 2001-01-01    3
2    A 2001-01-02    2
3    A 2001-01-03    6
5    B 2001-01-01    7
6    B 2001-01-02    4
7    B 2001-01-03    1

2) split / Filter / do.call

> do.call("rbind", Filter(function(x) nrow(x) > 1, split(my.df, my.df$col2)))
             col1       col2 col3
2001-01-01.1    A 2001-01-01    3
2001-01-01.5    B 2001-01-01    7
2001-01-02.2    A 2001-01-02    2
2001-01-02.6    B 2001-01-02    4
2001-01-03.3    A 2001-01-03    6
2001-01-03.7    B 2001-01-03    1

3) dplyr (2) translates nearly directly into a dplyr solution:

> library(dplyr)
> my.df %>% group_by(col2) %>% filter(n() > 1)
Source: local data frame [6 x 3]
Groups: col2

  col1       col2 col3
1    A 2001-01-01    5
2    A 2001-01-02    1
3    A 2001-01-03    7
4    B 2001-01-01    2
5    B 2001-01-02    4
6    B 2001-01-03    6

4) data.table The last two solutions can also be translated to data.table

> data.table(my.df)[, if (.N > 1) .SD, by = col2]
         col2 col1 col3
1: 2001-01-01    A    5
2: 2001-01-01    B    2
3: 2001-01-02    A    1
4: 2001-01-02    B    4
5: 2001-01-03    A    7
6: 2001-01-03    B    6

5) tapply

> na.omit(tapply(my.df$col3, my.df[c('col2', 'col1')], identity))
            col1
col2         A B
  2001-01-01 3 7
  2001-01-02 2 4
  2001-01-03 6 1
attr(,"na.action")
2001-02-03 2001-01-04 
         5          4 

6) merge

> merge(subset(my.df, col1 == 'A'), subset(my.df, col1 == 'B'), by = 2)
        col2 col1.x col3.x col1.y col3.y
1 2001-01-01      A      3      B      7
2 2001-01-02      A      2      B      4
3 2001-01-03      A      6      B      1

7) sqldf (6) is similar to the following sqldf solution:

> sqldf("select * from `my.df` A join `my.df` B 
+    on A.col2 = B.col2 and A.col1 = 'A' and B.col1 = 'B'")
  col1       col2 col3 col1       col2 col3
1    A 2001-01-01    5    B 2001-01-01    2
2    A 2001-01-02    1    B 2001-01-02    4
3    A 2001-01-03    7    B 2001-01-03    6
Comments