Todd Young Todd Young - 1 year ago 36
R Question

Data.table: creating new variables, subsetting by date periods, grouping by second variable, x, and aggregating results by final date of each x

I am completely new to data.table, so please bear with me. In the following example, I would like to create two new columns in my dataset: the total purchase price within two different time periods for each customer ID. So far, I have that working.

However, I get stuck when trying to subset the data. I would like to subset the data so that each row represents a unique customer ID along with their final purchase price, final purchase date, period 1 purchase price total, and their period 2 purchase price total.

I thought that data.table might replicate the computed totals over all rows corresponding to each customer. However, it only replicates those totals in rows corresponding to the period intervals specified in the table's i index. Since it does not replicate those totals over all rows for each customer, my final dplyr block does not do the trick.

In the second and third code blocks I will give the output of the final dplyr code and then what the output I would like to achieve.


data <- data.frame(custid = c(rep(1, 4), rep(2, 4), rep(1, 4), rep(2, 4)),
purchase.price = seq(1, 32, by=2),
date = seq.Date(from=as.Date("2015-01-01"), to=as.Date("2015-01-16"), by="days"))

period_intervals <- list(period_one = interval(as.Date("2015-01-01"), as.Date("2015-01-09")),
period_two = interval(as.Date("2015-01-10"), as.Date("2015-01-16")))

data <-
data <- data[order(date)]
setkey(data, custid)

data <- data[date %within% period_intervals[[1]], := sum(purchase.price),
by = custid]

data <- data[date %within% period_intervals[[2]], := sum(purchase.price),
by = custid]

data_sub <- data %>%
group_by(custid) %>%
arrange(desc(date)) %>%
filter(row_number() == 1)

Current result:

custid purchase.price date
<dbl> <dbl> <date> <dbl> <dbl>
1 31 2015-01-16 NA 112
2 23 2015-01-12 NA 63

Finally, what I am aiming for:

custid purchase.price date
<dbl> <dbl> <date> <dbl> <dbl>
1 31 2015-01-16 33 112
2 23 2015-01-12 48 63

Answer Source

We can do this by subsetting the purchase.price with the index

data[, .( = sum(purchase.price[date %within% 
                     period_intervals[[1]]])),by = custid]

For creating the columns simultaneously, we can use Map

nm1 <- c('period.1. total.sum', '')
data[,  (nm1) := Map(function(x,y) sum(purchase.price[x %within% y]), 
                   list(date), period_intervals), by = custid] 
data[order(custid, -date)][,.SD[1:.N==1] , custid]
#   custid purchase.price       date period.1. total.sum
#1:      1             23 2015-01-12                  33                 63
#2:      2             31 2015-01-16                  48                112

NOTE: The 'total' column creation code was not showed in the OP's post.