Noobie Noobie - 2 months ago 6
R Question

R: how to resample intraday data at the group level?

Consider the following dataframe

time <-c('2016-04-13 23:07:45','2016-04-13 23:07:50','2016-04-13 23:08:45','2016-04-13 23:08:45'
,'2016-04-13 23:08:45','2016-04-13 23:07:50','2016-04-13 23:07:51')
group <-c('A','A','A','B','B','B','B')
value<- c(5,10,2,2,NA,1,4)
df<-data.frame(time,group,value)

> df
time group value
1 2016-04-13 23:07:45 A 5
2 2016-04-13 23:07:50 A 10
3 2016-04-13 23:08:45 A 2
4 2016-04-13 23:08:45 B 2
5 2016-04-13 23:08:45 B NA
6 2016-04-13 23:07:50 B 1
7 2016-04-13 23:07:51 B 4


I would like to resample this dataframe at the
5 seconds level
-
group level
, and compute the sum of
value
for each
time-interval
-
group value
.

The interval should be closed on the left and open on the right. For instance, the first line of output should be

2016-04-13 23:07:45 A 5
because the first 5-sec interval is
[2016-04-13 23:07:45, 2016-04-13 23:07:50[


How can I do that in either
dplyr
or
data.table
? Do I need to import
lubridate
for the timestamps?

Answer

How about this:

Group5 <- function(myDf) {
    myDf$time <- ymd_hms(myDf$time)
    myDf$timeGroup <- floor_date(myDf$time, unit = "5 seconds")
    summarise(myDf %>% group_by(group, timeGroup), sum(value, na.rm = TRUE))
}

Group5(df)
Source: local data frame [5 x 3]
Groups: group [?]

   group           timeGroup `sum(value, na.rm = TRUE)`
  <fctr>              <dttm>                      <dbl>
1      A 2016-04-13 23:07:45                          5
2      A 2016-04-13 23:07:50                         10
3      A 2016-04-13 23:08:45                          2
4      B 2016-04-13 23:07:50                          5
5      B 2016-04-13 23:08:45                          2

It takes advantage of floor_date and ymd_hms from lubridate to put each date time into the proper group-time.

Here is a more exotic example:

set.seed(500)
time <- ymd_hms('2016-04-13 23:07:45') + sample(-10^3:10^3, 10^5, replace=TRUE)
group <- rep(LETTERS[1:20], each = 5000)
value <- rep(NA, 10^5)
value[sample(10^5, 95000)] <- sample(100, 95000, replace=TRUE)
df2 <- data.frame(time,group,value)

head(df2)
                 time group value
1 2016-04-13 23:18:53     A    53
2 2016-04-13 23:15:15     A    NA
3 2016-04-13 23:23:36     A    40
4 2016-04-13 23:06:40     A    23
5 2016-04-13 23:18:10     A    74
6 2016-04-13 22:57:56     A    65

Calling it we have:

Group5(df2)
Source: local data frame [8,020 x 3]
Groups: group [?]

    group           timeGroup `sum(value, na.rm = TRUE)`
   <fctr>              <dttm>                      <int>
1       A 2016-04-13 22:51:05                        379
2       A 2016-04-13 22:51:10                        646
3       A 2016-04-13 22:51:15                        391
4       A 2016-04-13 22:51:20                       1118
5       A 2016-04-13 22:51:25                        745
6       A 2016-04-13 22:51:30                        546
7       A 2016-04-13 22:51:35                        884
8       A 2016-04-13 22:51:40                        711
9       A 2016-04-13 22:51:45                        526
10      A 2016-04-13 22:51:50                        484
# ... with 8,010 more rows