Daedalus Daedalus - 1 month ago 7
R Question

Optimising data.table code for slicing and computation

here is a mini example to illustrate my task. I have a large data set, let's say data table dt, with columns date, exdate (the due date) and a price X (e.g. delivery price at exdate).

date exdate X
1999-01-01 1999-02-01 100
1999-01-01 1999-01-17 50
1999-01-01 1999-05-01 75
1999-01-02 1999-01-17 60
1999-01-02 1999-04-02 50
... ... ...


and so on.
I now want to perform certain analysis:


  1. For each date and each exdate, I want to apply functions to all prices corresponding to exdates greater than the current exdate. (e.g. check if price is higher:
    sapply(dt[date==date[1] & exdate > exdate[1],X],function(x) x>dt[date==date[1] & exdate = exdate[1],X])

  2. Same as 1. but now taking the same exdates and looking for all products corresponding to prices larger than the current one and again applying certain formulas.



I have tried this slicing with a combination of the data.table method is shown in 1. and a nested for loops iterating over each column. Needless to say, this takes quite long.

My next idea was to use a multidimensional array to perform the analysis simultaneously, but I could not get it to work.

My last try was exploring data.tables .SD and .GRP options, but also this approach was not fruitful, although I thought that I am just missing something here and this could be an option.

I have read quite a bit in this forum about slicing and optimisation, but given that the formulas I am applying are not trivial (like for example mean/sum/average could be considered), I could not apply the given solutions.

So if anyone has dealt with such a problem/task before or knows the matter I would highly appreciate any help I can get!

Thanks!

Answer Source

Concerning the first question, this can be solved using data.table's non-equi self-joins:

library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table and order data for later comparisons
setorder(setDT(DT), date, exdate, X)
# non-equi self-join
DT[DT, on = c("date", "exdate>=exdate"), 
   # aggregate
   .(min.X = min(X), min.exdate = x.exdate[which.min(X)], n.lt.1st = sum(X < first(X)),
     max.X = max(X), max.exdate = x.exdate[which.max(X)], n.gt.1st = sum(X > first(X))), 
   # for each join group
   by = .EACHI]
         date     exdate min.X min.exdate n.lt.1st max.X max.exdate n.gt.1st
1: 1999-01-01 1999-01-17    50 1999-01-17        0   100 1999-02-01        2
2: 1999-01-01 1999-02-01    75 1999-05-01        1   100 1999-02-01        0
3: 1999-01-01 1999-05-01    75 1999-05-01        0    75 1999-05-01        0
4: 1999-01-02 1999-01-17    50 1999-04-02        1    60 1999-01-17        0
5: 1999-01-02 1999-04-02    50 1999-04-02        0    50 1999-04-02        0

The aggregates are just examples which show the minimum value, the exdate of the minimum value, the number values smaller than the first value, and correspondingly the same for maximum.

Note that the condition is exdate>=exdate, not exdate>exdate. This is to include the "reference row" in the result set in order to compare with it, i.e, X < first(X).

For the second question, it is unclear what product refers to.

Data

library(data.table)
DT <- fread(
  "date          exdate       X
1999-01-01   1999-02-01   100
1999-01-01   1999-01-17    50
1999-01-01   1999-05-01    75
1999-01-02   1999-01-17    60
1999-01-02   1999-04-02    50"
)
cols <- c("date", "exdate")
DT[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
str(DT)
Classes ‘data.table’ and 'data.frame':    5 obs. of  4 variables:
 $ date   : IDate, format: "1999-01-01" "1999-01-01" "1999-01-01" "1999-01-02" ...
 $ exdate : IDate, format: "1999-01-17" "1999-02-01" "1999-05-01" "1999-01-17" ...
 $ X      : int  50 100 75 60 50
 $ refdate: IDate, format: "1999-01-17" "1999-02-01" "1999-05-01" "1999-01-17" ...
 - attr(*, ".internal.selfref")=<externalptr>

Note that datatables own date class IDate with integer storage is used.