ironv ironv - 18 days ago 5
R Question

speeding up a non-equal join

Inputs

> specialty.dt
specialty p1 p2
1: ZKWM 0.0000000 0.7377049
2: MZAY 0.7377049 1.0000000

> provider.dt
provSysId prob
1: 23 0.94225972
2: 16 0.39277028
3: 8 0.07162044
4: 25 0.42598790
5: 7 0.90370561
6: 12 0.71343887


Output

> prov_spec.dt
provSysId prob specialty p1 p2
1: 23 0.94225972 MZAY 0.7377049 1.0000000
2: 16 0.39277028 ZKWM 0.0000000 0.7377049
3: 8 0.07162044 ZKWM 0.0000000 0.7377049
4: 25 0.42598790 ZKWM 0.0000000 0.7377049
5: 7 0.90370561 MZAY 0.7377049 1.0000000
6: 12 0.71343887 ZKWM 0.0000000 0.7377049


Code to create above tables is shown below. For
num.provider=5
and
num.specialty=10000
, it takes almost 30s to create the output. I was wondering if there was a faster way to get the same result (without first doing a cartesian product because that would require a lot of memory).

require(data.table)

num.specialty <- 50
num.provider <- 10000

specialty.dt <- data.table(specialty=replicate(num.specialty, paste(sample(LETTERS, 4, replace=TRUE), collapse="")))[,
cnt:=sample(1:50, .N, replace=T)][, prob:=cnt/sum(cnt)][, p2:=cumsum(prob)][, p1:=shift(p2,,0)][,
c("specialty","p1","p2"), with=FALSE]

provider.dt <- data.table(provSysId=sample(seq(num.provider+1,num.provider*5), num.provider, replace=FALSE))[, prob:=runif(.N)]

system.time({
prov_spec.dt <- rbindlist(lapply(1:num.provider, function(n) {r <- provider.dt[n]; cbind(r,specialty.dt[p1 <= r[,prob] & r[,prob] < p2]) }))
})

Answer

Your rbindlist(lapply(...)) can be replaced with a non-equi join using version 1.9.7 of data.table

specialty.dt[ provider.dt, on = .(p1 <= prob, p2 > prob)]

This joins specialty.dt onto provider.dt directly, using the condition that p1 <= prob, and prob < p2.


References

Here's a list of similar questions

And here's a talk by Arun