user5843090 user5843090 - 5 months ago 9
MySQL Question

how to use r or sql to calculate the difference per group id?

I am looking for a way to calculate the time difference per group id. Here is part of my data:

ID road beginTime endTime Mon Tue Wed Thu Fri Sat
666 757 9:00 AM 11:45 AM S
555 758 1:55 PM 3:45 PM M W
555 759 10:40 AM 12:30 PM M W
555 760 4:00 PM 5:50 PM Tue R
444 761 3:00 PM 4:25 PM Tue R
444 762 4:30 PM 7:15 PM M
444 763 12:50 PM 2:40 PM Fri
444 764 10:40 AM 11:35 AM Tue R
222 765 11:45 AM 2:30 PM M W
222 766 6:00 PM 9:40 PM R
333 767 8:30 AM 11:15 AM M W
333 768 8:30 AM 11:15 AM Tue R
333 769 1:25 PM 2:50 PM Tue R
333 770 11:45 AM 1:10 PM M W


output from dput():

structure(list(ID = c(666L, 555L, 555L, 555L, 444L, 444L, 444L,
444L, 222L, 222L, 333L, 333L, 333L, 333L), road = 757:770, beginTime = structure(c(11L,
2L, 3L, 7L, 6L, 8L, 5L, 3L, 4L, 9L, 10L, 10L, 1L, 4L), .Label = c("1:25 PM",
"1:55 PM", "10:40 AM", "11:45 AM", "12:50 PM", "3:00 PM", "4:00 PM",
"4:30 PM", "6:00 PM", "8:30 AM", "9:00 AM"), class = "factor"),
endTime = structure(c(4L, 9L, 5L, 11L, 10L, 12L, 7L, 3L,
6L, 13L, 2L, 2L, 8L, 1L), .Label = c("1:10 PM", "11:15 AM",
"11:35 AM", "11:45 AM", "12:30 PM", "2:30 PM", "2:40 PM",
"2:50 PM", "3:45 PM", "4:25 PM", "5:50 PM", "7:15 PM", "9:40 PM"
), class = "factor"), Mon = structure(c(1L, 2L, 2L, 1L, 1L,
2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L), .Label = c("", "M"), class = "factor"),
Tue = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L,
1L, 2L, 2L, 1L), .Label = c("", "Tue"), class = "factor"),
Wed = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 2L), .Label = c("", "W"), class = "factor"),
Thu = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L,
1L, 2L, 2L, 1L), .Label = c("", "R"), class = "factor"),
Fri = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("", "Fri"), class = "factor"),
Sat = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("", "S"), class = "factor")), .Names = c("ID",
"road", "beginTime", "endTime", "Mon", "Tue", "Wed", "Thu", "Fri",
"Sat"), class = "data.frame", row.names = c(NA, -14L))


each ID drove on different roads (road) at different time (beginTime,endTime) of the day. I want calculate waiting (non-driving) time for each ID. For example, ID=555 drove on Monday and Wednesday. The first period is 10:40am - 12:30pm. And it waited for 1.41 hours and then started another period between 1:55 - 3:45. The waiting time of 1.41 hours is what I need. There is another waiting time when this id drove on Tuesday and Thursday. For ID=666, it only drove on Saturday for one period, so waiting time is 0. The difficulty for my data is that each ID has various periods each day. Any suggestions? Thanks very much!

Answer

Using the "long" format I mentioned in the comments makes things a little easier.

First, I'll clean up your data a little: convert factors to strings, then strings to times (df is your data as dputed above):

library(dplyr)
# small helper function
astime <- function(x) as.POSIXct(x, format = "%I:%M %p")
df2 <- df %>%
  mutate_each(funs(as.character), beginTime:Sat) %>%
  mutate_each(funs(astime), beginTime, endTime)
head(df2)
#    ID road           beginTime             endTime Mon Tue Wed Thu Fri Sat
# 1 666  757 2016-06-21 09:00:00 2016-06-21 11:45:00                       S
# 2 555  758 2016-06-21 13:55:00 2016-06-21 15:45:00   M       W            
# 3 555  759 2016-06-21 10:40:00 2016-06-21 12:30:00   M       W            
# 4 555  760 2016-06-21 16:00:00 2016-06-21 17:50:00     Tue       R        
# 5 444  761 2016-06-21 15:00:00 2016-06-21 16:25:00     Tue       R        
# 6 444  762 2016-06-21 16:30:00 2016-06-21 19:15:00   M                    

(Don't worry that the dates are all wrong, that should be ignored.) Now I'll convert from wide to long and remove those instances where the days are empty strings:

library(tidyr)
df3 <- df2 %>%
  gather(day, ign, Mon:Sat) %>%
  filter(ign != "") %>%
  select(-ign)
head(df3)
#    ID road           beginTime             endTime day
# 1 555  758 2016-06-21 13:55:00 2016-06-21 15:45:00 Mon
# 2 555  759 2016-06-21 10:40:00 2016-06-21 12:30:00 Mon
# 3 444  762 2016-06-21 16:30:00 2016-06-21 19:15:00 Mon
# 4 222  765 2016-06-21 11:45:00 2016-06-21 14:30:00 Mon
# 5 333  767 2016-06-21 08:30:00 2016-06-21 11:15:00 Mon
# 6 333  770 2016-06-21 11:45:00 2016-06-21 13:10:00 Mon

Now I'll group them and calculate the time waiting:

df4 <- df3 %>%
  arrange(ID, day, beginTime) %>%
  group_by(ID, day) %>%
  mutate(
    waitTime = difftime(beginTime, dplyr::lag(endTime, default = beginTime[1]), units='secs')
  )
head(df4)
# Source: local data frame [6 x 6]
# Groups: ID, day [5]
#      ID  road           beginTime             endTime   day       waitTime
#   <int> <int>              <time>              <time> <chr> <S3: difftime>
# 1   222   765 2016-06-21 11:45:00 2016-06-21 14:30:00   Mon         0 secs
# 2   222   766 2016-06-21 18:00:00 2016-06-21 21:40:00   Thu         0 secs
# 3   222   765 2016-06-21 11:45:00 2016-06-21 14:30:00   Wed         0 secs
# 4   333   767 2016-06-21 08:30:00 2016-06-21 11:15:00   Mon         0 secs
# 5   333   770 2016-06-21 11:45:00 2016-06-21 13:10:00   Mon      1800 secs
# 6   333   768 2016-06-21 08:30:00 2016-06-21 11:15:00   Thu         0 secs

You can easily filter for those times when somebody waited with:

df4 %>%
  filter(waitTime > 0)
# Source: local data frame [8 x 6]
# Groups: ID, day [8]
#      ID  road           beginTime             endTime   day       waitTime
#   <int> <int>              <time>              <time> <chr> <S3: difftime>
# 1   333   770 2016-06-21 11:45:00 2016-06-21 13:10:00   Mon      1800 secs
# 2   333   769 2016-06-21 13:25:00 2016-06-21 14:50:00   Thu      7800 secs
# 3   333   769 2016-06-21 13:25:00 2016-06-21 14:50:00   Tue      7800 secs
# 4   333   770 2016-06-21 11:45:00 2016-06-21 13:10:00   Wed      1800 secs
# 5   444   761 2016-06-21 15:00:00 2016-06-21 16:25:00   Thu     12300 secs
# 6   444   761 2016-06-21 15:00:00 2016-06-21 16:25:00   Tue     12300 secs
# 7   555   758 2016-06-21 13:55:00 2016-06-21 15:45:00   Mon      5100 secs
# 8   555   758 2016-06-21 13:55:00 2016-06-21 15:45:00   Wed      5100 secs

in which case you'll see that your example of ID 555 on Monday and Wednesday had a 1.41 hour (5100sec) break, and ID 666 had no wait time.