Ben Ben - 17 days ago 6
R Question

Fast calculation of CDF / rolling join on multiple columns

I'm trying to measure the empirical cumulative distribution of some data in a multivariate setting. That is, given a dataset like

library(data.table) # v 1.9.7

set.seed(2016)
dt <- data.table(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
dt
x y z
1: -0.91474 2.07025 -1.7499
2: 1.00125 -1.80941 -1.3856
3: -0.05642 1.58499 0.8110
4: 0.29665 -1.16660 0.3757
5: -2.79147 -1.75526 1.2851
---
996: 0.63423 0.13597 -2.3710
997: 0.21415 1.03161 -1.5440
998: 1.15357 -1.63713 0.4191
999: 0.79205 -0.56119 0.6670
1000: 0.19502 -0.05297 -0.3288


I want to count the number of samples such that (x <= X, y <= Y, z <= Z) for some grid of (X, Y, Z) upper bounds like

bounds <- CJ(X=seq(-2, 2, by=.1), Y=seq(-2, 2, by=.1), Z=seq(-2, 2, by=.1))
bounds
X Y Z
1: -2 -2 -2.0
2: -2 -2 -1.9
3: -2 -2 -1.8
4: -2 -2 -1.7
5: -2 -2 -1.6
---
68917: 2 2 1.6
68918: 2 2 1.7
68919: 2 2 1.8
68920: 2 2 1.9
68921: 2 2 2.0


Now, I've figured out that I can elegantly do this (using non-equi joins)

dt[, Count := 1]
result <- dt[bounds, on=c("x<=X", "y<=Y", "z<=Z"), allow.cartesian=TRUE][, list(N.cum = sum(!is.na(Count))), keyby=list(X=x, Y=y, Z=z)]
result[, CDF := N.cum/nrow(dt)]
result
X Y Z N.cum CDF
1: -2 -2 -2.0 0 0.000
2: -2 -2 -1.9 0 0.000
3: -2 -2 -1.8 0 0.000
4: -2 -2 -1.7 0 0.000
5: -2 -2 -1.6 0 0.000
---
68917: 2 2 1.6 899 0.899
68918: 2 2 1.7 909 0.909
68919: 2 2 1.8 917 0.917
68920: 2 2 1.9 924 0.924
68921: 2 2 2.0 929 0.929


But this method is really inefficient and gets very slow as I start increasing the bin count. I think a multivariate version of
data.table
's rolling join functionality would do the trick, but that's not possible to my knowledge. Any suggestions to speed this up?

Ben Ben
Answer

Not ideal, but this solution is better than my previous

X <- data.table(X=seq(-2, 2, by=.1)); X[, x := X]
Y <- data.table(Y=seq(-2, 2, by=.1)); Y[, y := Y]
Z <- data.table(Z=seq(-2, 2, by=.1)); Z[, z := Z]

dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]
bg <- dt[, .N, keyby=list(X, Y, Z)]

bounds <- CJ(X=X$X, Y=Y$Y, Z=Z$Z)

result <- bg[bounds, on=c("X<=X", "Y<=Y", "Z<=Z"), allow.cartesian=TRUE][, list(N.cum = sum(N, na.rm=TRUE)), keyby=list(X, Y, Z)]
result[, CDF := N.cum/nrow(dt)]
result

Update

No time to explain right now, but this is the solution I was looking for

X <- data.table(X=seq(-2, 2, by=.1)); X[, x := X]
Y <- data.table(Y=seq(-2, 2, by=.1)); Y[, y := Y]
Z <- data.table(Z=seq(-2, 2, by=.1)); Z[, z := Z]

dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]
bg <- dt[, .N, keyby=list(X, Y, Z)]

bounds <- CJ(X=X$X, Y=Y$Y, Z=Z$Z)

kl <- bg[bounds, on=c("X", "Y", "Z")]
kl[is.na(N), N := 0]

# Counting
kl[, CountUntil.XgivenYZ := cumsum(N), by=list(Y, Z)]
kl[, CountUntil.XYgivenZ := cumsum(CountUntil.XgivenYZ), by=list(X, Z)]
kl[, CountUntil.XYZ := cumsum(CountUntil.XYgivenZ), by=list(X, Y)]

# Cleanup
setnames(kl, "CountUntil.XYZ", "N.cum")
kl[, CDF := N.cum/nrow(dt)]