EdM - 7 months ago 28

R Question

Ok, here is the problem that I would love to solve using an efficient, elegant solution such as data.table or dplyr.

Define:

`DT = data.table(group=c(rep("A",3),rep("B",5)),value=c(2,9,2,3,4,1,0,3))`

time group value

1: 1 A 2

2: 2 A 9

3: 3 A 2

4: 1 B 3

5: 2 B 4

6: 3 B 1

7: 4 B 0

8: 5 B 3

What I'm trying to get is a cumulative sum by group of the values divided by their inverse ordering at the moment of time they've been observed.

`time group value RESULT`

1: 1 A 2 2.000000

2: 2 A 9 10.000000

3: 3 A 2 7.166667

4: 1 B 3 3.000000

5: 2 B 4 5.500000

6: 3 B 1 4.000000

7: 4 B 0 2.583333

8: 5 B 3 4.933333

At line 5 the result is:

`4/1 + 3/2 = 5.5`

Because at time 2, the group B had 2 observations, the last is divided by 1 and the previous by 1.

Next in line 6 the result is:

`1/1 + 4/2+ 3/3 = 4`

Since at time 3, the group B had 3 observations, the last is divided by 1, the previous by 2 and the still previous by 3. At line 7,

`0/1 + 1/2 + 4/3 + 3/4 = 2.583333`

The data is big, so avoiding loops is essential!

Answer

I'd use matrix algebra:

```
n_max = DT[, .N, by=group][, max(N)]
m = matrix(0, n_max, n_max)
m[] = ifelse( col(m) >= row(m), 1 / (col(m) - row(m) + 1 ), m)
DT[, res := value %*% m[seq_len(.N), seq_len(.N)], by=group ]
group value res
1: A 2 2.000000
2: A 9 10.000000
3: A 2 7.166667
4: B 3 3.000000
5: B 4 5.500000
6: B 1 4.000000
7: B 0 2.583333
8: B 3 4.933333
```