I have a data set containing ID, Date, and event. Event is a binary outcome variable.
There can only ever be a max of one event per ID. The event may be followed by more 0's. I want to remove all zeroes that appear after the event by group. I have a solution using dplyr below but am interested to know if there is a better method. What is a good way of indicating that an event has occurred at some point in the past?
d <-as.Date("01-05-15", "%d-%m-%y")
df <- data.frame(ID= c(rep(234,4),rep(235,6), rep(237,5)),
date = as.Date(c((d-4):(d-1),(d-1):(d+4),(d+1):(d+5)),origin="1960-10-01"),
event = c(0,1,0,0,0,0,0,0,0,0,0,0,0,1,0))
#How can Improve this?
df %>% group_by(ID) %>%
mutate(cumulative = lag(cumsum(event), default = 0)) %>%
filter(cumulative <1) %>%
select(-cumulative) %>% ungroup
We can try with
dplyr. After grouping by 'ID', check whether
all the elements in 'event' is 0 (
all(event == 0)) or (
|) sequence of rows is less than or equal to the index of the first maximum element of 'event' (
row_number() <= which.max(event)) and use that logical index to
filter the rows.
library(dplyr) df %>% group_by(ID) %>% filter(row_number() <= which.max(event)|all(event==0)) # ID date event # <dbl> <date> <dbl> #1 234 2015-04-27 0 #2 234 2015-04-28 1 #3 235 2015-04-30 0 #4 235 2015-05-01 0 #5 235 2015-05-02 0 #6 235 2015-05-03 0 #7 235 2015-05-04 0 #8 235 2015-05-05 0 #9 237 2015-05-02 0 #10 237 2015-05-03 0 #11 237 2015-05-04 0 #12 237 2015-05-05 1
Or a bit more compact option would be a double cumulative sum on the
event that is equal to 1 and checking whether it is less than 2 in the
df %>% group_by(ID) %>% filter(cumsum(cumsum(event == 1))<2)
data.table, convert the 'data.frame' to 'data.table' (
setDT(df)), grouped by 'ID'
all the element of 'event' is 0, Subset the Data.table (
else Subset the Data.table with the rows including from the first to the first maximum element in 'event'.
library(data.table) setDT(df)[, if(all(event==0)) .SD else .SD[seq(which.max(event))], by = ID]