John John - 1 month ago 13
R Question

R How to get desired rows in aggregation using data.table and dplyr

Suppose we have a data set like:

X = data.frame(
ID = 1:9,
DRIVE_NUM = c("A","A","A","B","B","B","C","C","C"),
FLAG =c("PASS","FAIL","PASS","PASS","PASS","PASS","PASS","FAIL","FAIL")
)

ID DRIVE_NUM FLAG
1 A PASS
2 A FAIL
3 A PASS
-----------------
4 B PASS
5 B PASS
6 B PASS
-----------------
7 C PASS
8 C FAIL
9 C FAIL


I want to aggregate this data set by DRIVE_NUM by the following rule:


For a specific DRIVE_NUM group,

If there is any FAIL flag in the DRIVE_NUM group, I want the first row
with the FAIL flag.

If there is no FAIL flag in the group, just take the first row in the
group.


So, I shall get the following set:

wanted = data.frame(
ID = c(2,4,8),
DRIVE_NUM = c("A","B","C"),
FLAG = c("FAIL","PASS","FAIL")
)

ID DRIVE_NUM FLAG
2 A FAIL
4 B PASS
8 C FAIL


Now I could do it using ddply but it's extremely slow since my data set is usually very big.

Is there any way to do it using data.table or dplyr.




Update:

It seems that dplyr is even slower than plyr. Is there any way to do anything faster than plyr. Or am I using anything inappropirately?

#Simulate Data

X = data.frame(
group = rep(paste0("NO",1:10000),each=2),
flag = sample(c("F","P"),20000,replace = TRUE),
var = rnorm(20000)
)



library(plyr)
library(dplyr)

#plyr

START = proc.time()
X2 = ddply(X,.(flag),function(df) {
if( sum(df$flag=="F")> 0){
R = df[df$flag=="F",]
if(nrow(R)>1) {R = R[1,]} else {R = R}
} else{
R = df[1,]
}
R
})
proc.time() - START

#user system elapsed
#0.03 0.00 0.03

#dplyr method 1

START = proc.time()
X %>%
group_by(group) %>%
slice(which.min(flag))
proc.time() - START

#user system elapsed
#0.22 0.02 0.23

#dplyr method 2

START = proc.time()
X %>%
group_by(group, flag) %>%
slice(1) %>%
group_by(group) %>%
slice(which.min(flag))
proc.time() - START

#user system elapsed
#0.28 0.00 0.28

Answer

Since we only ever want the first row for each DRIVE_NUM:FLAG combination, we start by selecting only those first rows using group_by and slice. Then we make sure we arrange by FLAG, so if there is a FAIL it will be the top row for that DRIVE_NUM, but if there is only a PASS, then nothing changes. Then we slice again to select that top row.

library(dplyr)
X %>%
  group_by(DRIVE_NUM, FLAG) %>%
  slice(1) %>%
  group_by(DRIVE_NUM) %>% 
  slice(which.min(FLAG))

@Frank basically solved this much shorter, relying on which.min taking the first value (first row).

X %>%
  group_by(DRIVE_NUM) %>% 
  slice(which.min(FLAG))
Comments