Jonno Bourne Jonno Bourne - 2 months ago 13
R Question

Removing rows from grouped data after an event has occurred

I have a data set containing ID, Date, and event. Event is a binary outcome variable.
There can only ever be a max of one event per ID. The event may be followed by more 0's. I want to remove all zeroes that appear after the event by group. I have a solution using dplyr below but am interested to know if there is a better method. What is a good way of indicating that an event has occurred at some point in the past?

library(dplyr)
d <-as.Date("01-05-15", "%d-%m-%y")
#Starting dataset
df <- data.frame(ID= c(rep(234,4),rep(235,6), rep(237,5)),
date = as.Date(c((d-4):(d-1),(d-1):(d+4),(d+1):(d+5)),origin="1960-10-01"),
event = c(0,1,0,0,0,0,0,0,0,0,0,0,0,1,0))

#desired result
df[c(1:2,5:14),]

#How can Improve this?
df %>% group_by(ID) %>%
mutate(cumulative = lag(cumsum(event), default = 0)) %>%
filter(cumulative <1) %>%
select(-cumulative) %>% ungroup

Answer

We can try with dplyr. After grouping by 'ID', check whether all the elements in 'event' is 0 (all(event == 0)) or (|) sequence of rows is less than or equal to the index of the first maximum element of 'event' (row_number() <= which.max(event)) and use that logical index to filter the rows.

library(dplyr)
df %>%
   group_by(ID) %>% 
   filter(row_number() <= which.max(event)|all(event==0))
#       ID       date event
#   <dbl>     <date> <dbl>
#1    234 2015-04-27     0
#2    234 2015-04-28     1
#3    235 2015-04-30     0
#4    235 2015-05-01     0
#5    235 2015-05-02     0
#6    235 2015-05-03     0
#7    235 2015-05-04     0
#8    235 2015-05-05     0
#9    237 2015-05-02     0
#10   237 2015-05-03     0
#11   237 2015-05-04     0
#12   237 2015-05-05     1

Or a bit more compact option would be a double cumulative sum on the event that is equal to 1 and checking whether it is less than 2 in the filter call.

df %>%
   group_by(ID) %>% 
   filter(cumsum(cumsum(event == 1))<2)

Or using data.table, convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID' if all the element of 'event' is 0, Subset the Data.table (.SD) or else Subset the Data.table with the rows including from the first to the first maximum element in 'event'.

library(data.table)
setDT(df)[, if(all(event==0)) .SD else .SD[seq(which.max(event))], by = ID]