Todd Young - 1 year ago 41

R Question

I am completely new to data.table, so please bear with me. In the following example, I would like to create two new columns in my dataset: the total purchase price within two different time periods for each customer ID. So far, I have that working.

However, I get stuck when trying to subset the data. I would like to subset the data so that each row represents a unique customer ID along with their final purchase price, final purchase date, period 1 purchase price total, and their period 2 purchase price total.

I thought that data.table might replicate the computed totals over all rows corresponding to each customer. However, it only replicates those totals in rows corresponding to the period intervals specified in the table's *i* index. Since it does not replicate those totals over all rows for each customer, my final dplyr block does not do the trick.

In the second and third code blocks I will give the output of the final dplyr code and then what the output I would like to achieve.

`library(lubridate)`

library(data.table)

library(dplyr)

data <- data.frame(custid = c(rep(1, 4), rep(2, 4), rep(1, 4), rep(2, 4)),

purchase.price = seq(1, 32, by=2),

date = seq.Date(from=as.Date("2015-01-01"), to=as.Date("2015-01-16"), by="days"))

period_intervals <- list(period_one = interval(as.Date("2015-01-01"), as.Date("2015-01-09")),

period_two = interval(as.Date("2015-01-10"), as.Date("2015-01-16")))

data <- as.data.table(data)

data <- data[order(date)]

setkey(data, custid)

data <- data[date %within% period_intervals[[1]],

period.1.price.total := sum(purchase.price),

by = custid]

data <- data[date %within% period_intervals[[2]],

period.2.price.total := sum(purchase.price),

by = custid]

data_sub <- data %>%

group_by(custid) %>%

arrange(desc(date)) %>%

filter(row_number() == 1)

Current result:

`custid purchase.price date period.1.price.total period.2.price.total`

<dbl> <dbl> <date> <dbl> <dbl>

1 31 2015-01-16 NA 112

2 23 2015-01-12 NA 63

Finally, what I am aiming for:

`custid purchase.price date period.1.price.total period.2.price.total`

<dbl> <dbl> <date> <dbl> <dbl>

1 31 2015-01-16 33 112

2 23 2015-01-12 48 63

Answer Source

We can do this by subsetting the purchase.price with the index

```
data[, .(period.1.total.sum = sum(purchase.price[date %within%
period_intervals[[1]]])),by = custid]
```

For creating the columns simultaneously, we can use `Map`

```
nm1 <- c('period.1. total.sum', 'period.2.total.sum')
data[, (nm1) := Map(function(x,y) sum(purchase.price[x %within% y]),
list(date), period_intervals), by = custid]
data[order(custid, -date)][,.SD[1:.N==1] , custid]
# custid purchase.price date period.1. total.sum period.2.total.sum
#1: 1 23 2015-01-12 33 63
#2: 2 31 2015-01-16 48 112
```

NOTE: The 'total' column creation code was not showed in the OP's post.