ds_user ds_user - 1 month ago 8
R Question

Finding first occurence of a value in every group R

I have this dataset about each machine in a lab,

MachineID InstalledDate SwitchedOnDate Status
1 2010-02-18 2010-02-19 SleepMode
1 2010-02-18 2010-02-20 Active
1 2010-02-18 2010-02-21 SleepMode
1 2010-02-18 2010-02-22 Active
2 2010-02-20 2010-02-21 Active
2 2010-02-20 2010-02-22 SleepMode
3 2010-02-10 2010-02-18 SleepMode
4 2010-03-10 2010-03-15 SleepMode


I am trying to find out how many days it took for each machine to go Active for the first time from Installed Date. So that is "SwitchedOnDate - InstalledDate".

Answer

Following @Gregor's and @Frank's comments, a better approach is to use distinct to retain only the (first) unique rows for each MachineID instead of grouping by MachineID:

library(dplyr)
res <- df %>% filter(Status=="Active") %>%
              distinct(MachineID, .keep_all=TRUE) %>%
              mutate(Days.Go.Active=difftime(SwitchedOnDate,InstalledDate,units="days"))
print(res)
##Source: local data frame [2 x 5]
##Groups: MachineID [2]
##
##  MachineID InstalledDate SwitchedOnDate Status Days.Go.Active
##      <int>        <date>         <date>  <chr> <S3: difftime>
##1         1    2010-02-18     2010-02-20 Active         2 days
##2         2    2010-02-20     2010-02-21 Active         1 days

With dplyr, you can mutate using difftime to compute the difference in units of "days":

library(dplyr)
res <- df %>% group_by(MachineID) %>% 
              filter(Status=="Active") %>%
              filter(row_number()==1) %>%
              mutate(Days.Go.Active=difftime(SwitchedOnDate,InstalledDate,units="days"))
print(res)
##Source: local data frame [2 x 5]
##Groups: MachineID [2]
##
##  MachineID InstalledDate SwitchedOnDate Status Days.Go.Active
##      <int>        <date>         <date>  <chr> <S3: difftime>
##1         1    2010-02-18     2010-02-20 Active         2 days
##2         2    2010-02-20     2010-02-21 Active         1 days

Here, we group_by MachineID and then use filter to keep only first row from each group that have Status of Active.


Data:

df <- structure(list(MachineID = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L), 
    InstalledDate = structure(c(14658, 14658, 14658, 14658, 14660, 
    14660, 14650, 14678), class = "Date"), SwitchedOnDate = structure(c(14659, 
    14660, 14661, 14662, 14661, 14662, 14658, 14683), class = "Date"), 
    Status = c("SleepMode", "Active", "SleepMode", "Active", 
    "Active", "SleepMode", "SleepMode", "SleepMode")), .Names = c("MachineID", 
"InstalledDate", "SwitchedOnDate", "Status"), row.names = c(NA, 
-8L), class = "data.frame")
##  MachineID InstalledDate SwitchedOnDate    Status
##1         1    2010-02-18     2010-02-19 SleepMode
##2         1    2010-02-18     2010-02-20    Active
##3         1    2010-02-18     2010-02-21 SleepMode
##4         1    2010-02-18     2010-02-22    Active
##5         2    2010-02-20     2010-02-21    Active
##6         2    2010-02-20     2010-02-22 SleepMode
##7         3    2010-02-10     2010-02-18 SleepMode
##8         4    2010-03-10     2010-03-15 SleepMode
Comments