Daedalus - 6 months ago 22
R Question

Optimising data.table code for slicing and computation

here is a mini example to illustrate my task. I have a large data set, let's say data table dt, with columns date, exdate (the due date) and a price X (e.g. delivery price at exdate).

``````date          exdate       X
1999-01-01   1999-02-01   100
1999-01-01   1999-01-17    50
1999-01-01   1999-05-01    75
1999-01-02   1999-01-17    60
1999-01-02   1999-04-02    50
...            ...         ...
``````

and so on.
I now want to perform certain analysis:

1. For each date and each exdate, I want to apply functions to all prices corresponding to exdates greater than the current exdate. (e.g. check if price is higher:
`sapply(dt[date==date[1] & exdate > exdate[1],X],function(x) x>dt[date==date[1] & exdate = exdate[1],X])`

2. Same as 1. but now taking the same exdates and looking for all products corresponding to prices larger than the current one and again applying certain formulas.

I have tried this slicing with a combination of the data.table method is shown in 1. and a nested for loops iterating over each column. Needless to say, this takes quite long.

My next idea was to use a multidimensional array to perform the analysis simultaneously, but I could not get it to work.

My last try was exploring data.tables .SD and .GRP options, but also this approach was not fruitful, although I thought that I am just missing something here and this could be an option.

I have read quite a bit in this forum about slicing and optimisation, but given that the formulas I am applying are not trivial (like for example mean/sum/average could be considered), I could not apply the given solutions.

So if anyone has dealt with such a problem/task before or knows the matter I would highly appreciate any help I can get!

Thanks!

Concerning the first question, this can be solved using `data.table`'s non-equi self-joins:

``````library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table and order data for later comparisons
setorder(setDT(DT), date, exdate, X)
# non-equi self-join
DT[DT, on = c("date", "exdate>=exdate"),
# aggregate
.(min.X = min(X), min.exdate = x.exdate[which.min(X)], n.lt.1st = sum(X < first(X)),
max.X = max(X), max.exdate = x.exdate[which.max(X)], n.gt.1st = sum(X > first(X))),
# for each join group
by = .EACHI]
``````
``````         date     exdate min.X min.exdate n.lt.1st max.X max.exdate n.gt.1st
1: 1999-01-01 1999-01-17    50 1999-01-17        0   100 1999-02-01        2
2: 1999-01-01 1999-02-01    75 1999-05-01        1   100 1999-02-01        0
3: 1999-01-01 1999-05-01    75 1999-05-01        0    75 1999-05-01        0
4: 1999-01-02 1999-01-17    50 1999-04-02        1    60 1999-01-17        0
5: 1999-01-02 1999-04-02    50 1999-04-02        0    50 1999-04-02        0
``````

The aggregates are just examples which show the minimum value, the `exdate` of the minimum value, the number values smaller than the first value, and correspondingly the same for maximum.

Note that the condition is `exdate>=exdate`, not `exdate>exdate`. This is to include the "reference row" in the result set in order to compare with it, i.e, `X < first(X)`.

For the second question, it is unclear what product refers to.

Data

``````library(data.table)
"date          exdate       X
1999-01-01   1999-02-01   100
1999-01-01   1999-01-17    50
1999-01-01   1999-05-01    75
1999-01-02   1999-01-17    60
1999-01-02   1999-04-02    50"
)
cols <- c("date", "exdate")
DT[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
str(DT)
``````
``````Classes ‘data.table’ and 'data.frame':    5 obs. of  4 variables:
\$ date   : IDate, format: "1999-01-01" "1999-01-01" "1999-01-01" "1999-01-02" ...
\$ exdate : IDate, format: "1999-01-17" "1999-02-01" "1999-05-01" "1999-01-17" ...
\$ X      : int  50 100 75 60 50
\$ refdate: IDate, format: "1999-01-17" "1999-02-01" "1999-05-01" "1999-01-17" ...
- attr(*, ".internal.selfref")=<externalptr>
``````

Note that `datatable`s own date class `IDate` with integer storage is used.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download