Todd Young Todd Young - 10 months ago 42
R Question

Data.table: dynamically creating variables over various subsets of data and grouping by variable x, subsetting final results

I am creating three types of variables over multiple time periods for individual customer IDs in my data. These new variables include the sum of a price vector, the mean of a price vector, and the mean difference between successive dates in a date vector.

Using data.table, I am looping through multiple time periods, subsetting the data in each period, and calculating those variables for individual customer IDs. Each of these variables are named dynamically as I loop over the time periods. As it stands, these variables are being computed correctly.

Here is where I am getting stuck: after all of these variables are computed, I would like to subset the data to include the new aggregated variables along with the most recent purchase.price and date elements for each customer.

I thought that data.table might replicate the computed totals over all rows corresponding to each customer. However, it only replicates those totals in rows corresponding to the period intervals specified in the table's i index. Since it does not replicate those totals over all rows for each customer, my final dplyr block does not do the trick.

In the second and third code blocks I will give the output of the final dplyr code and then what the output I would like to achieve.

This question stems from a similar problem noted where we are subsetting over fewer variables that are not being created dynamically.


data <- data.frame(custid = c(rep(1, 25), rep(2, 25), rep(1, 25), rep(2, 25)),
purchase.price = seq(1, 200, by=2),
date = seq.Date(from=as.Date("2015-01-01"), to=as.Date("2015-04-10"), by="days"))

period_intervals <- list(period_one = interval(as.Date("2015-01-01"), as.Date("2015-01-30")),
period_two = interval(as.Date("2015-02-01"), as.Date("2015-02-28")),
period_three = interval(as.Date("2015-03-01"), as.Date("2015-03-31")),
period_four = interval(as.Date("2015-04-01"), as.Date("2015-04-28")))

data <-
data <- data[order(date)]
setkey(data, custid)

time_periods <- c(1:4)

for(i in time_periods[1]:max(time_periods)){
data <- data[date %within% period_intervals[[i]],
paste("period", i, "price.sum", sep="."):= sum(purchase.price),
by = custid]

data <- data[date %within% period_intervals[[i]],
paste("period", i, "price.mean", sep="."):= mean(purchase.price),
by = custid]

data <- data[date %within% period_intervals[[i]],
paste("period", i, "", sep="."):= mean(as.numeric(diff(purchase.price))),
by = custid]

data_sub <- data %>%
group_by(custid) %>%
arrange(desc(date)) %>%
filter(row_number() == 1)

Current result from dplyr subsetting (showing the first 7 columns):

custid purchase.price date period.1.price.sum period.1.price.mean period.2.price.sum ...
<dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> ...
1 2 199 2015-04-10 NA NA NA NA ...
2 1 149 2015-03-16 NA NA NA NA ...

Here is what I was hoping for (showing first 7 columns):

custid purchase.price date period.1.price.sum period.1.price.mean period.2.price.sum ...
<dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> ...
1 2 199 2015-04-10 625 25 2 981 ...
2 1 149 2015-03-16 275 55 2 1539 ...


In my complete dataset, I am looping over anywhere between 10-20 time periods. The number of periods to be computed over is subject to change, thus my approach to dynamically create the new variables.

Answer Source

We can use Map as in the previous post

nm1 <- sprintf("%s.%d.%s", "period", seq_along(period_intervals), "price.sum")
nm2 <- sprintf("%s.%d.%s", "period", seq_along(period_intervals), "price.mean")
nm3 <- sprintf("%s.%d.%s", "period", seq_along(period_intervals), "")
data[,   c(rbind(nm1, nm2, nm3)) := unlist(Map(function(x,y) {
         x1 <- purchase.price[x %within% y]
         list(sum(x1), mean(x1), mean(as.numeric(diff(x1))))},
                   list(date), period_intervals), recursive = FALSE), by = custid]
data[order(custid, -date)][,.SD[1] , custid]