Rickard - 1 year ago 68
R Question

# Summarise transactions fulfilling a criterion in a sliding window

We have a table of transactions

``````set.seed(1)
X <- data.table(id = 1:10,
time = c(1,2,5,6,9,12,14,20,21,23),
val = sample(0.1*10^(1:4), 10, replace=TRUE),
code = sample(c('A','A','C','B'), 10, replace=TRUE)
)

id time  val code
1:  1    1   10    A
2:  2    2   10    A
3:  3    5  100    C
4:  4    6 1000    A
5:  5    9    1    B
6:  6   12 1000    A
7:  7   14 1000    C
8:  8   20  100    B
9:  9   21  100    A
10: 10   23    1    B
``````

For each row I want to count occurrences of
`code == 'A'`
along with the sum of
`val`
for previous rows where
`previous_row\$time >= current_row\$time - 3`

I.e. the expected result should be

``````    id time  val code count_A_within_3 sum_a_within_3
1:  1    1   10    A                1             10
2:  2    2   10    A                2             20
3:  3    5  100    C                1             10
4:  4    6 1000    A                1           1000
5:  5    9    1    B                1           1000
6:  6   12 1000    A                1           1000
7:  7   14 1000    C                1           1000
8:  8   20  100    B                0              0
9:  9   21  100    A                1            100
10: 10   23    1    B                1            100
``````

Is this possible to calculate efficiently using
`data.table`
or
`dplyr`
?

The real dataset contains ~1M groups where this operation should be performed within each group. Number of rows in each group range from 1 to 1000. An imperative solution (for loop with nested ifs and state-variables) is doable but very slow.

Using non-equi joins from the latest devel version (1.9.7+):

``````X[, prev.time := time - 3]

X[, c("count_A_within_3", "sum_a_within_3") :=
X[X, on = .(time >= prev.time, time <= time),
.(sum(code == "A"), sum(val[code == "A"])), by = .EACHI][, .(V1, V2)]]
X
#    id time  val code prev.time count_A_within_3 sum_a_within_3
# 1:  1    1   10    A        -2                1             10
# 2:  2    2   10    A        -1                2             20
# 3:  3    5  100    C         2                1             10
# 4:  4    6 1000    A         3                1           1000
# 5:  5    9    1    B         6                1           1000
# 6:  6   12 1000    A         9                1           1000
# 7:  7   14 1000    C        11                1           1000
# 8:  8   20  100    B        17                0              0
# 9:  9   21  100    A        18                1            100
#10: 10   23    1    B        20                1            100
``````

You may want to replace the inner two `X`'s with `.SD` for more robust code; I only left it as `X` for clarity/ease of understanding how it works.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download