Anirban Mukherjee Anirban Mukherjee - 2 months ago 19
R Question

Non-equi join, then summarize by group

Here is a MWE.

dta <- data.table(id=rep(1:2, each=5), seq=rep(1:5, 2), val=1:10)
dtb <- data.table(id=c(1, 1, 2, 2), fil=c(2, 3, 3, 4))
dtc <- data.table(id=c(1, 1, 2, 2), mval=rep(0, 4))
for (ind in 1:4) dtc$mval[ind] <- mean( dta$val [dta$id == dtb$id[ind] & dta$seq < dtb$fil[ind]] )

dtc
# id mval
# 1: 1 1.0
# 2: 1 1.5
# 3: 2 6.5
# 4: 2 7.0


dtc should have the same number of rows as dtb. For every (row)
ind
in dtc,


  1. dtc$id[ind]
    =
    dtb$id[ind]
    .

  2. dtc$mval[ind]
    =
    mean(dta$val[x])
    , where x is
    dta$id == dtb$id[ind] & dta$seq < dtb$fil[ind]
    .



My data.tables are extremely large. Hence, I am looking for a way to achieve the above with minimal memory footprint. I was thinking a non-equi join and then a summarize, but I can't seem to get that to work. Hence, the title of the question.

Would greatly appreciate any help, thanks!

Answer

May be this helps

dtc[, mval := dta[dtb, mean(val) ,on =.(id, seq < fil), by = .EACHI]$V1]
dtc
#   id mval
#1:  1  1.0
#2:  1  1.5
#3:  2  6.5
#4:  2  7.0