drfosterrainsnafu drfosterrainsnafu - 11 months ago 39
R Question

How do I build a Multiple Criteria Index in R, incorporating > and < operators?

I'm trying to lookup and merge two data frames I have on multiple criteria and having a lot of trouble making sense of it.

I've got two sets of data, one I'm looking to index against that contains the validity dates of some products and the other that gives me product use, as follows:

Indexset <- data.table(validfrom=as.Date(c("2015-08-01", "2015-08-02", "2015-08-03", "2015-08-04")),
validto=as.Date(c("2015-08-07", "2015-08-08","2015-08-09", "2015-08-10")),
username=c("Smith", "Cole", "Amos", "Richardson"),

Useset <- data.table(usedate=as.Date(c("2015-08-04", "2015-08-06", "2015-08-06", "2015-08-09")),
username=c("Smith", "Richardson", "Cole", "Amos"))

What I'm looking to do is add a column to 'Useset' that will contain the 'productcode' from 'Indexset' by checking the 'usedate' is between the 'validto' & 'validfrom' dates and then matching the name.

I've tried various ways around the 'merge' function but couldn't work out how to get the greater than and less than operators into the syntax.

Have also tried setting and rolling joins but struggling to make them work.

Currently migrating from excel where this would just be a multiple criteria index match in an array but not sure how to put it together in R.

To be clear, I'm not hitting errors, just completely lost in putting the syntax together.

Once I have the 'productcodes' I think I can handle the merge but absolutely stumped here!

Would appreciate any help anyone can offer!

Answer Source

You may try the non-equi join feature from the current development version of data.table, v1.9.7. Please read the installation instructions here.

Useset[Indexset, on = .(username, usedate >= validfrom, usedate <= validto), # which rows?
         productcode := productcode][]                                       # what to do?

#       usedate   username productcode
# 1: 2015-08-01      Smith           1
# 2: 2015-08-04 Richardson           4
# 3: 2015-08-02       Cole           2
# 4: 2015-08-03       Amos           3

Usenet gets updated in-place.

If the performance is not a big issue, join on username and then filter should also work, this does not require the dev version of data.table:

Useset[Indexset, on = "username"][usedate >= validfrom & usedate <= validto, .(usedate, username, productcode)]
#      usedate   username productcode
# 1: 2015-08-01      Smith           1
# 2: 2015-08-02       Cole           2
# 3: 2015-08-03       Amos           3
# 4: 2015-08-04 Richardson           4