Ben - 1 year ago 115
R Question

# Fast calculation of CDF / rolling join on multiple columns

I'm trying to measure the empirical cumulative distribution of some data in a multivariate setting. That is, given a dataset like

``````library(data.table)  # v 1.9.7

set.seed(2016)
dt <- data.table(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))
dt
x        y       z
1: -0.91474  2.07025 -1.7499
2:  1.00125 -1.80941 -1.3856
3: -0.05642  1.58499  0.8110
4:  0.29665 -1.16660  0.3757
5: -2.79147 -1.75526  1.2851
---
996:  0.63423  0.13597 -2.3710
997:  0.21415  1.03161 -1.5440
998:  1.15357 -1.63713  0.4191
999:  0.79205 -0.56119  0.6670
1000:  0.19502 -0.05297 -0.3288
``````

I want to count the number of samples such that (x <= X, y <= Y, z <= Z) for some grid of (X, Y, Z) upper bounds like

``````bounds <- CJ(X=seq(-2, 2, by=.1), Y=seq(-2, 2, by=.1), Z=seq(-2, 2, by=.1))
bounds
X  Y    Z
1: -2 -2 -2.0
2: -2 -2 -1.9
3: -2 -2 -1.8
4: -2 -2 -1.7
5: -2 -2 -1.6
---
68917:  2  2  1.6
68918:  2  2  1.7
68919:  2  2  1.8
68920:  2  2  1.9
68921:  2  2  2.0
``````

Now, I've figured out that I can elegantly do this (using non-equi joins)

``````dt[, Count := 1]
result <- dt[bounds, on=c("x<=X", "y<=Y", "z<=Z"), allow.cartesian=TRUE][, list(N.cum = sum(!is.na(Count))), keyby=list(X=x, Y=y, Z=z)]
result[, CDF := N.cum/nrow(dt)]
result
X  Y    Z N.cum   CDF
1: -2 -2 -2.0     0 0.000
2: -2 -2 -1.9     0 0.000
3: -2 -2 -1.8     0 0.000
4: -2 -2 -1.7     0 0.000
5: -2 -2 -1.6     0 0.000
---
68917:  2  2  1.6   899 0.899
68918:  2  2  1.7   909 0.909
68919:  2  2  1.8   917 0.917
68920:  2  2  1.9   924 0.924
68921:  2  2  2.0   929 0.929
``````

But this method is really inefficient and gets very slow as I start increasing the bin count. I think a multivariate version of
`data.table`
's rolling join functionality would do the trick, but that's not possible to my knowledge. Any suggestions to speed this up?

Not ideal, but this solution is better than my previous

``````X <- data.table(X=seq(-2, 2, by=.1)); X[, x := X]
Y <- data.table(Y=seq(-2, 2, by=.1)); Y[, y := Y]
Z <- data.table(Z=seq(-2, 2, by=.1)); Z[, z := Z]

dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]
bg <- dt[, .N, keyby=list(X, Y, Z)]

bounds <- CJ(X=X\$X, Y=Y\$Y, Z=Z\$Z)

result <- bg[bounds, on=c("X<=X", "Y<=Y", "Z<=Z"), allow.cartesian=TRUE][, list(N.cum = sum(N, na.rm=TRUE)), keyby=list(X, Y, Z)]
result[, CDF := N.cum/nrow(dt)]
result
``````

# Update

No time to explain right now, but this is the solution I was looking for

``````X <- data.table(X=seq(-2, 2, by=.1)); X[, x := X]
Y <- data.table(Y=seq(-2, 2, by=.1)); Y[, y := Y]
Z <- data.table(Z=seq(-2, 2, by=.1)); Z[, z := Z]

dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]
bg <- dt[, .N, keyby=list(X, Y, Z)]

bounds <- CJ(X=X\$X, Y=Y\$Y, Z=Z\$Z)

kl <- bg[bounds, on=c("X", "Y", "Z")]
kl[is.na(N), N := 0]

# Counting
kl[, CountUntil.XgivenYZ := cumsum(N), by=list(Y, Z)]
kl[, CountUntil.XYgivenZ := cumsum(CountUntil.XgivenYZ), by=list(X, Z)]
kl[, CountUntil.XYZ := cumsum(CountUntil.XYgivenZ), by=list(X, Y)]

# Cleanup
setnames(kl, "CountUntil.XYZ", "N.cum")
kl[, CDF := N.cum/nrow(dt)]
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download