KTY KTY - 3 months ago 27
R Question

Use of lapply .SD in data.table R

I am not very clear about use of

.SD
and
by
.

For instance, does the below snippet mean: 'change all the columns in
DT
to factor except
A
and
B
?' It also says in
data.table
manual: "
.SD
refers to the Subset of the
data.table
for each group (excluding the grouping columns)" - so columns
A
and
B
are excluded?

DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]


However, I also read that
by
means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like
colsum
in SQL) over all the columns except
A
and
B
do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns
A
and
B
? (take sum and group by
A,B
as in SQL)

DT[,lapply(.SD,sum),by=.(A,B)]


Then how do I do a simple
colsum
over all the columns except
A
and
B
?

Answer

Just to illustrate the comments above with an example, let's take

set.seed(10238)
DT <- data.table(A = rep(1:3, each = 5), B = rep(1:5, 3),
                 C = sample(15), D = sample(15))
DT
#     A B  C  D
#  1: 1 1 14 11
#  2: 1 2  3  8
#  3: 1 3 15  1
#  4: 1 4  1 14
#  5: 1 5  5  9
#  6: 2 1  7 13
#  7: 2 2  2 12
#  8: 2 3  8  6
#  9: 2 4  9 15
# 10: 2 5  4  3
# 11: 3 1  6  5
# 12: 3 2 12 10
# 13: 3 3 10  4
# 14: 3 4 13  7
# 15: 3 5 11  2

Compare the following:

#Sum all columns
DT[ , lapply(.SD, sum)]
#     A  B   C   D
# 1: 30 45 120 120
#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
#    A  B  C  D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28
#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
#     B   C   D
# 1: 45 120 120
#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
#    B  C  D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14

A few notes:

  • You said "does the below snippet... change all the columns in DT..."

The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.

  • You mentioned wanting to change the column types

Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT <- DT[ , lapply(.SD, as.factor)].

This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.

  • You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table; to see this in action, let's beef up your DT a bit and run a benchmark:

Results:

library(data.table)
set.seed(12039)
nn <- 1e7; kk <- seq(100L)
DT <- as.data.table(replicate(26, sample(kk, nn, T))
                    )[ , LETTERS[1:2] := .(sample(100, nn, T),
                                           sample(100, nn, T))]

library(microbenchmark)
microbenchmark(times = 100L,
               colsums = colSums(DT[ , !c("A", "B"), with = FALSE]),
               lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")])
# Unit: milliseconds
#     expr      min       lq     mean   median       uq      max neval
#  colsums 848.9310 886.6289 906.8105 896.7696 925.4353 997.0001   100
#  lapplys 144.5028 145.7165 154.4077 147.5586 153.2286 253.6726   100
Comments