Rickard Rickard - 2 months ago 11
R Question

Summarise transactions fulfilling a criterion in a sliding window

We have a table of transactions

set.seed(1)
X <- data.table(id = 1:10,
time = c(1,2,5,6,9,12,14,20,21,23),
val = sample(0.1*10^(1:4), 10, replace=TRUE),
code = sample(c('A','A','C','B'), 10, replace=TRUE)
)


id time val code
1: 1 1 10 A
2: 2 2 10 A
3: 3 5 100 C
4: 4 6 1000 A
5: 5 9 1 B
6: 6 12 1000 A
7: 7 14 1000 C
8: 8 20 100 B
9: 9 21 100 A
10: 10 23 1 B


For each row I want to count occurrences of
code == 'A'
along with the sum of
val
for previous rows where
previous_row$time >= current_row$time - 3

I.e. the expected result should be

id time val code count_A_within_3 sum_a_within_3
1: 1 1 10 A 1 10
2: 2 2 10 A 2 20
3: 3 5 100 C 1 10
4: 4 6 1000 A 1 1000
5: 5 9 1 B 1 1000
6: 6 12 1000 A 1 1000
7: 7 14 1000 C 1 1000
8: 8 20 100 B 0 0
9: 9 21 100 A 1 100
10: 10 23 1 B 1 100


Is this possible to calculate efficiently using
data.table
or
dplyr
?

The real dataset contains ~1M groups where this operation should be performed within each group. Number of rows in each group range from 1 to 1000. An imperative solution (for loop with nested ifs and state-variables) is doable but very slow.

Answer

Using non-equi joins from the latest devel version (1.9.7+):

X[, prev.time := time - 3]

X[, c("count_A_within_3", "sum_a_within_3") :=
      X[X, on = .(time >= prev.time, time <= time),
        .(sum(code == "A"), sum(val[code == "A"])), by = .EACHI][, .(V1, V2)]]
X
#    id time  val code prev.time count_A_within_3 sum_a_within_3
# 1:  1    1   10    A        -2                1             10
# 2:  2    2   10    A        -1                2             20
# 3:  3    5  100    C         2                1             10
# 4:  4    6 1000    A         3                1           1000
# 5:  5    9    1    B         6                1           1000
# 6:  6   12 1000    A         9                1           1000
# 7:  7   14 1000    C        11                1           1000
# 8:  8   20  100    B        17                0              0
# 9:  9   21  100    A        18                1            100
#10: 10   23    1    B        20                1            100

You may want to replace the inner two X's with .SD for more robust code; I only left it as X for clarity/ease of understanding how it works.

Comments