James White James White - 3 months ago 12
R Question

Using lapply to output values between date ranges within different factor levels

I have 2 dataframes, one representing daily sales figures of different stores (df1) and one representing when each store has been audited (df2). I need to create a new dataframe displaying sales information from each site taken 1 week before each audit (i.e. the information in df2). Some example data, firstly for the daily sales figures from different stores across a certain period:

Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Sales <- as.data.frame(matrix(sample(0:50, 30*10, replace=TRUE), ncol=3))
df1 <- cbind(Dates,Sales)
colnames(df1) <- c("Dates","Site.A","Site.B","Site.C")


And for the dates of each audit across different stores:

Store<- c("Store.A","Store.A","Store.B","Store.C","Store.C")
Audit_Dates <- as.data.frame(as.POSIXct(c("2016/1/4","2016/3/1","2016/2/1","2016/2/1","2016/3/1")))
df2 <- as.data.frame(cbind(Store,Audit_Dates ))
colnames(df2) <- c("Store","Audit_Dates")


Of note is that there will be an uneven amount of dates within each output (i.e. there may not be a full weeks worth of information prior to some store audits). I have previously asked a question addressing a similar problem Creating a dataframe from an lapply function with different numbers of rows. Below shows an answer from this which would work for an example if I was to consider information from only 1 store:

library(lubridate)
##Data input
Store.A_Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Store.A_Sales <- as.data.frame(matrix(sample(0:50, 10*10, replace=TRUE), ncol=1))
Store.A_df1 <- cbind(Store.A_Dates,Store.A_Sales)
colnames(Store.A_df1) <- c("Store.A_Dates","Store.A_Sales")
Store.A_df2 <- as.Date(c("2016/1/3","2016/3/1"))

##Output
Store.A_output<- lapply(Store.A_df2, function(x) {Store.A_df1[difftime(Store.A_df1[,1], x - days(7)) >= 0 & difftime(Store.A_df1[,1], x) <= 0, ]})
n1 <- max(sapply(Store.A_output, nrow))
output <- data.frame(lapply(Store.A_output, function(x) x[seq_len(n1),]))


But I don't know how I would get this for multiple sites.

Answer

Try this:

# Renamed vars for my convenience...
colnames(df1) <- c("t","Store.A","Store.B","Store.C")
colnames(df2) <- c("Store","t")

library(tidyr)
library(dplyr)

# Gather df1 so that df1 and df2 have the same format:

df1 = gather(df1, Store, Sales, -t)
head(df1)
           t   Store Sales
1 2015-12-30 Store.A    16
2 2015-12-31 Store.A    24
3 2016-01-01 Store.A     8
4 2016-01-02 Store.A    42
5 2016-01-03 Store.A     7
6 2016-01-04 Store.A    46

# This lapply call does not iterate over actual values, just indexes, which allows
# you to subset the data comfortably:

r <- lapply(1:nrow(df2), function(i) {
   audit.t = df2[i, "t"]                                     #time of audit
   audit.s = df1[, "Store"] == df2[i, "Store"]               #store audited
   df = df1[audit.s, ]                             #data from audited store

   week_before = difftime(df[, "t"], audit.t - (7*24*3600)) >= 0
   week_audit  = difftime(df[, "t"], audit.t) <= 0

   df[week_before & week_audit, ]
})

Does this give you the proper subsets?

Also, to summarise your results:

r = do.call("rbind", r) %>% 
  group_by(audited, Store) %>% 
  summarise(sales = sum(Sales))

r

     audited   Store sales
      <time>   <chr> <int>
1 2016-01-04 Store.A    97
2 2016-02-01 Store.B   156
3 2016-02-01 Store.C   226
4 2016-03-01 Store.A   115
5 2016-03-01 Store.C   187