thelatemail - 1 month ago 4
R Question

# data.table grouping separately on numeric and text variables

I'm trying to simplify this

`data.table`
two-stage process which acts on both numeric and character variables. E.g. - take the first element of
`textvar`
and
`sum`
each of the numeric variables. Consider this small example:

``````library(data.table)
dt <- data.table(grpvar=letters[c(1,1,2)], textvar=c("one","two","one"),
numvar=1:3, othernum=2:4)
dt
#   grpvar textvar numvar othernum
#1:      a     one      1        2
#2:      a     two      2        3
#3:      b     one      3        4
``````

Now my first thought was to nest
`.SD`
to drop the one variable out of the
`lapply`
call, but I thought that was a bit complicated:

``````dt[, c(textvar=textvar[1], .SD[, lapply(.SD, sum), .SDcols=-c("textvar")]), by=grpvar]
#   grpvar textvar numvar othernum
#1:      a     one      3        5
#2:      b     one      3        4
``````

Then I thought maybe I could do each grouping separately and join them, but that seems even worse:

``````dt[, .(textvar=textvar[1]), by=grpvar][
dt[, lapply(.SD, sum), by=grpvar, .SDcols=-c("textvar")], on="grpvar"
]
#   grpvar textvar numvar othernum
#1:      a     one      3        5
#2:      b     one      3        4
``````

Is there a simpler construction that would get around the nesting of
`.SD`
or the joining? I feel like I'm overlooking something elementary.

The `j`-argument in data.table is (deliberately) quite flexible. All we need to remember is that:

As long as `j` returns a list, each element of the list will become a column in the resulting data.table.

Using the fact that `c(list, list)` is a `list`, we can construct the expression as follows:

``````dt[, c(textvar = textvar[1L], lapply(.SD, sum)), # select/compute all cols necessary
.SDcols = numvar:othernum,                 # provide .SD's columns
by = grpvar]                               # group by 'grpvar'
#    grpvar textvar numvar othernum
# 1:      a     one      3        5
# 2:      b     one      3        4
``````

Here, I've not wrapped the first expression with `list()` since `textvar[1L]` returns a length=1 vector.. i.e., `identical(c(1, list(2, 3)), c(list(1), list(2,3)))` is `TRUE`.

Note that this is only possible from `v1.9.7`. The bug was just recently fixed in the current development version.