user3022875 user3022875 - 2 months ago 7
R Question

adding columns based on ordering and grouping

I have this data frame

dat = data.frame(ID= c(1,1,1,2,3,3),
NumberInSequence= c(1,2,3,1,1,2),
StartTime = as.POSIXct(c("2016-01-01 05:52:05 GMT","2016-01-01 05:52:11 GMT","2016-01-01 05:52:16 GMT","2016-01-01 05:40:05 GMT","2016-01-01 06:12:13 GMT","2016-01-01 07:12:26 GMT")) ,
EndTime = as.POSIXct(c("2016-01-01 05:52:10 GMT","2016-01-01 05:52:16 GMT","2016-01-01 05:52:30 GMT","2016-01-01 05:46:05 GMT","2016-01-01 06:12:25 GMT","2016-01-01 08:00:00 GMT") )
)


dat

ID NumberInSequence StartTime EndTime
1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10
2 1 2 2016-01-01 05:52:11 2016-01-01 05:52:16
3 1 3 2016-01-01 05:52:16 2016-01-01 05:52:30
4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05
5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25
6 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00


Each ID can have 1 or more rows and each row is in time order for every ID. I'd like to add 2 columns:

1st column: "Duration" which the length of time, in seconds, BETWEEN the EndTime of an ID to the next start time of the same id.

2nd column: "Next start time" which is the actual next start time of the same ID.

So the results should look like this:

dat$Duration = ?

dat$NextStartTime = ?

ID NumberInSequence StartTime EndTime Duration NextStartTime
1 1 1 2016-01-01 05:52:05 2016-01-01 05:52:10 1 2016-01-01 05:52:11
2 1 2 2016-01-01 05:52:11 2016-01-01 05:52:16 0 2016-01-01 05:52:16
3 1 3 2016-01-01 05:52:16 2016-01-01 05:52:30 NA NA
4 2 1 2016-01-01 05:40:05 2016-01-01 05:46:05 NA NA
5 3 1 2016-01-01 06:12:13 2016-01-01 06:12:25 3601 2016-01-01 07:12:26
6 3 2 2016-01-01 07:12:26 2016-01-01 08:00:00 NA NA


For example for ID = 3 a NUMBER IN SEQUENCE = 1 the end time is 61 seconds later so the duration is 61 and the next start time is 07:12:26 from ID = 3 and number in sequence = 2.

For rows where there is there is no next start time NA should appear like ID = 2 and number in sequence = 1.

I was looking at doing this with dplyr somehow....

######## UPDATE

lead() is the answer but there is a minor issue. Please see this post

using dplyr lead but with some contraints

Answer

For NextStartTime, as long as your data is in order (use arrange(ID, NumberInSequence) if unsure), you can use dplyr::lead, which is like stats::lag on a time series with a negative lag.

For Duration, you can subtract times, but if the units can be on different units of magnitude, it's safer to use difftime directly so you can keep consistent units.

All together:

library(dplyr)

dat %>% group_by(ID) %>% 
    arrange(ID, NumberInSequence) %>%    # not necessary if already arranged, as here
    mutate(NextStartTime = lead(StartTime), 
           Duration = difftime(NextStartTime, EndTime, units = 's'))

## Source: local data frame [6 x 6]
## Groups: ID [3]
## 
##      ID NumberInSequence           StartTime             EndTime       NextStartTime  Duration
##   <dbl>            <dbl>              <dttm>              <dttm>              <dttm>    <time>
## 1     1                1 2016-01-01 05:52:05 2016-01-01 05:52:10 2016-01-01 05:52:11    1 secs
## 2     1                2 2016-01-01 05:52:11 2016-01-01 05:52:16 2016-01-01 05:52:16    0 secs
## 3     1                3 2016-01-01 05:52:16 2016-01-01 05:52:30                <NA>   NA secs
## 4     2                1 2016-01-01 05:40:05 2016-01-01 05:46:05                <NA>   NA secs
## 5     3                1 2016-01-01 06:12:13 2016-01-01 06:12:25 2016-01-01 07:12:26 3601 secs
## 6     3                2 2016-01-01 07:12:26 2016-01-01 08:00:00                <NA>   NA secs
Comments