Ben - 10 months ago 58

R Question

I'm trying to measure the empirical cumulative distribution of some data in a multivariate setting. That is, given a dataset like

`library(data.table) # v 1.9.7`

set.seed(2016)

dt <- data.table(x=rnorm(1000), y=rnorm(1000), z=rnorm(1000))

dt

x y z

1: -0.91474 2.07025 -1.7499

2: 1.00125 -1.80941 -1.3856

3: -0.05642 1.58499 0.8110

4: 0.29665 -1.16660 0.3757

5: -2.79147 -1.75526 1.2851

---

996: 0.63423 0.13597 -2.3710

997: 0.21415 1.03161 -1.5440

998: 1.15357 -1.63713 0.4191

999: 0.79205 -0.56119 0.6670

1000: 0.19502 -0.05297 -0.3288

I want to count the number of samples such that (x <= X, y <= Y, z <= Z) for some grid of (X, Y, Z) upper bounds like

`bounds <- CJ(X=seq(-2, 2, by=.1), Y=seq(-2, 2, by=.1), Z=seq(-2, 2, by=.1))`

bounds

X Y Z

1: -2 -2 -2.0

2: -2 -2 -1.9

3: -2 -2 -1.8

4: -2 -2 -1.7

5: -2 -2 -1.6

---

68917: 2 2 1.6

68918: 2 2 1.7

68919: 2 2 1.8

68920: 2 2 1.9

68921: 2 2 2.0

Now, I've figured out that I can elegantly do this (using non-equi joins)

`dt[, Count := 1]`

result <- dt[bounds, on=c("x<=X", "y<=Y", "z<=Z"), allow.cartesian=TRUE][, list(N.cum = sum(!is.na(Count))), keyby=list(X=x, Y=y, Z=z)]

result[, CDF := N.cum/nrow(dt)]

result

X Y Z N.cum CDF

1: -2 -2 -2.0 0 0.000

2: -2 -2 -1.9 0 0.000

3: -2 -2 -1.8 0 0.000

4: -2 -2 -1.7 0 0.000

5: -2 -2 -1.6 0 0.000

---

68917: 2 2 1.6 899 0.899

68918: 2 2 1.7 909 0.909

68919: 2 2 1.8 917 0.917

68920: 2 2 1.9 924 0.924

68921: 2 2 2.0 929 0.929

But this method is really inefficient and gets very slow as I start increasing the bin count. I think a multivariate version of

`data.table`

Answer Source

Not ideal, but this solution is better than my previous

```
X <- data.table(X=seq(-2, 2, by=.1)); X[, x := X]
Y <- data.table(Y=seq(-2, 2, by=.1)); Y[, y := Y]
Z <- data.table(Z=seq(-2, 2, by=.1)); Z[, z := Z]
dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]
bg <- dt[, .N, keyby=list(X, Y, Z)]
bounds <- CJ(X=X$X, Y=Y$Y, Z=Z$Z)
result <- bg[bounds, on=c("X<=X", "Y<=Y", "Z<=Z"), allow.cartesian=TRUE][, list(N.cum = sum(N, na.rm=TRUE)), keyby=list(X, Y, Z)]
result[, CDF := N.cum/nrow(dt)]
result
```

No time to explain right now, but this is the solution I was looking for

```
X <- data.table(X=seq(-2, 2, by=.1)); X[, x := X]
Y <- data.table(Y=seq(-2, 2, by=.1)); Y[, y := Y]
Z <- data.table(Z=seq(-2, 2, by=.1)); Z[, z := Z]
dt <- X[dt, on="x", roll=-Inf, nomatch=0]
dt <- Y[dt, on="y", roll=-Inf, nomatch=0]
dt <- Z[dt, on="z", roll=-Inf, nomatch=0]
bg <- dt[, .N, keyby=list(X, Y, Z)]
bounds <- CJ(X=X$X, Y=Y$Y, Z=Z$Z)
kl <- bg[bounds, on=c("X", "Y", "Z")]
kl[is.na(N), N := 0]
# Counting
kl[, CountUntil.XgivenYZ := cumsum(N), by=list(Y, Z)]
kl[, CountUntil.XYgivenZ := cumsum(CountUntil.XgivenYZ), by=list(X, Z)]
kl[, CountUntil.XYZ := cumsum(CountUntil.XYgivenZ), by=list(X, Y)]
# Cleanup
setnames(kl, "CountUntil.XYZ", "N.cum")
kl[, CDF := N.cum/nrow(dt)]
```