Christoph_J - 1 day ago 4
R Question

# Is my way of duplicating rows in data.table efficient?

I have monthly data in one

`data.table`
and annual data in another
`data.table`
and now I want to match the annual data to the respective observation in the monthly data.

My approach is as follows: Duplicating the annual data for every month and then join the monthly and annual data. And now I have a question regarding the duplication of rows. I know how to do it, but I'm not sure if it is the best way to do it, so some opinions would be great.

Here is an exemplatory
`data.table DT`
for my annual data and how I currently duplicate:

``````library(data.table)
DT <- data.table(ID = paste(rep(c("a", "b"), each=3), c(1:3, 1:3), sep="_"),
values = 10:15,
startMonth = seq(from=1, by=2, length=6),
endMonth = seq(from=3, by=3, length=6))
DT
ID values startMonth endMonth
[1,] a_1     10          1        3
[2,] a_2     11          3        6
[3,] a_3     12          5        9
[4,] b_1     13          7       12
[5,] b_2     14          9       15
[6,] b_3     15         11       18
#1. Alternative
DT1 <- DT[, list(MONTH=startMonth:endMonth), by="ID"]
setkey(DT,  ID)
setkey(DT1, ID)
DT1[DT]
ID MONTH values startMonth endMonth
a_1     1     10          1        3
a_1     2     10          1        3
a_1     3     10          1        3
a_2     3     11          3        6
[...]
``````

The last join is exactly what I want. However,
`DT[, list(MONTH=startMonth:endMonth), by="ID"]`
`DT`
, so I was wondering if I could get rid of the last three rows in my code, i.e. the
`setkey`
and
`join`
operations. It turns out, you can, just do the following:

``````#2. Alternative: More intuitiv and just one line of code
DT[, list(MONTH=startMonth:endMonth, values, startMonth, endMonth), by="ID"]
ID MONTH values startMonth endMonth
a_1    1     10          1        3
a_1    2     10          1        3
a_1    3     10          1        3
a_2    3     11          3        6
...
``````

This, however, only works because I hardcoded the column names into the
`list`
expression. In my real data, I do not know the names of all columns in advance, so I was wondering if I could just tell
`data.table`
to return the column
`MONTH`
that I compute as shown above and all the other columns of
`DT`
.
`.SD`
seemed to be able to do the trick, but:

``````DT[, list(MONTH=startMonth:endMonth, .SD), by="ID"]
Error in `[.data.table`(DT, , list(YEAR = startMonth:endMonth, .SD), by = "ID") :
maxn (4) is not exact multiple of this j column's length (3)
``````

So to summarize, I know how it's been done, but I was just wondering if this is the best way to do it because I'm still struggling a little bit with the syntax of
`data.table`
and often read in posts and on the wiki that there are good and bads ways of doing things. Also, I don't quite get why I get an error when using
`.SD`
. I thought it is just any easy way to tell
`data.table`
that you want all columns. What do I miss?

Great question. What you tried was very reasonable. Assuming you're using v1.7.1 it's now easier to make `list` columns. In this case it's trying to make one `list` column out of `.SD` (3 items) alongside the MONTH column of the 2nd group (4 items). I'll raise it as a bug [EDIT: now fixed in v1.7.5], thanks.

In the meantime, try :

``````DT[, cbind(MONTH=startMonth:endMonth, .SD), by="ID"]
ID MONTH values startMonth endMonth
a_1     1     10          1        3
a_1     2     10          1        3
a_1     3     10          1        3
a_2     3     11          3        6
...
``````

Also, just to check you've seen `roll=TRUE`? Typically you'd have just one startMonth column (irregular with gaps) and then just `roll` join to it. Your example data has overlapping month ranges though, so that complicates it.