Daedalus - 1 month ago 7

R Question

here is a mini example to illustrate my task. I have a large data set, let's say data table dt, with columns date, exdate (the due date) and a price X (e.g. delivery price at exdate).

`date exdate X`

1999-01-01 1999-02-01 100

1999-01-01 1999-01-17 50

1999-01-01 1999-05-01 75

1999-01-02 1999-01-17 60

1999-01-02 1999-04-02 50

... ... ...

and so on.

I now want to perform certain analysis:

- For each date and each exdate, I want to apply functions to all prices corresponding to exdates
**greater**than the current exdate. (e.g. check if price is higher:`sapply(dt[date==date[1] & exdate > exdate[1],X],function(x) x>dt[date==date[1] & exdate = exdate[1],X])`

- Same as 1. but now taking the same exdates and looking for all products corresponding to prices larger than the current one and again applying certain formulas.

I have tried this slicing with a combination of the data.table method is shown in 1. and a nested for loops iterating over each column. Needless to say, this takes quite long.

My next idea was to use a multidimensional array to perform the analysis simultaneously, but I could not get it to work.

My last try was exploring data.tables .SD and .GRP options, but also this approach was not fruitful, although I thought that I am just missing something here and this could be an option.

I have read quite a bit in this forum about slicing and optimisation, but given that the formulas I am applying are not trivial (like for example mean/sum/average could be considered), I could not apply the given solutions.

So if anyone has dealt with such a problem/task before or knows the matter I would highly appreciate any help I can get!

Thanks!

Answer Source

Concerning the first question, this can be solved using `data.table`

's *non-equi self-joins*:

```
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table and order data for later comparisons
setorder(setDT(DT), date, exdate, X)
# non-equi self-join
DT[DT, on = c("date", "exdate>=exdate"),
# aggregate
.(min.X = min(X), min.exdate = x.exdate[which.min(X)], n.lt.1st = sum(X < first(X)),
max.X = max(X), max.exdate = x.exdate[which.max(X)], n.gt.1st = sum(X > first(X))),
# for each join group
by = .EACHI]
```

`date exdate min.X min.exdate n.lt.1st max.X max.exdate n.gt.1st 1: 1999-01-01 1999-01-17 50 1999-01-17 0 100 1999-02-01 2 2: 1999-01-01 1999-02-01 75 1999-05-01 1 100 1999-02-01 0 3: 1999-01-01 1999-05-01 75 1999-05-01 0 75 1999-05-01 0 4: 1999-01-02 1999-01-17 50 1999-04-02 1 60 1999-01-17 0 5: 1999-01-02 1999-04-02 50 1999-04-02 0 50 1999-04-02 0`

The aggregates are just examples which show the minimum value, the `exdate`

of the minimum value, the number values smaller than the first value, and correspondingly the same for maximum.

Note that the condition is `exdate>=exdate`

, not `exdate>exdate`

. This is to include the "reference row" in the result set in order to compare with it, i.e, `X < first(X)`

.

For the second question, it is unclear what *product* refers to.

```
library(data.table)
DT <- fread(
"date exdate X
1999-01-01 1999-02-01 100
1999-01-01 1999-01-17 50
1999-01-01 1999-05-01 75
1999-01-02 1999-01-17 60
1999-01-02 1999-04-02 50"
)
cols <- c("date", "exdate")
DT[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
str(DT)
```

`Classes ‘data.table’ and 'data.frame': 5 obs. of 4 variables: $ date : IDate, format: "1999-01-01" "1999-01-01" "1999-01-01" "1999-01-02" ... $ exdate : IDate, format: "1999-01-17" "1999-02-01" "1999-05-01" "1999-01-17" ... $ X : int 50 100 75 60 50 $ refdate: IDate, format: "1999-01-17" "1999-02-01" "1999-05-01" "1999-01-17" ... - attr(*, ".internal.selfref")=<externalptr>`

Note that `datatable`

s own date class `IDate`

with integer storage is used.