C8H10N4O2 C8H10N4O2 - 1 month ago 18
R Question

Unexpected .GRP sequence in data.table

Given a

data.table
such as:

library(data.table)
n = 5000
set.seed(123)
pop = data.table(id=1:n, age=sample(18:80, n, replace=TRUE))


and a function which converts a numeric vector into an ordered factor, such as:

toAgeGroups <- function(x){
groups=c('Under 40','40-64','65+')
grp = findInterval(x, c(40,65)) +1
factor(groups[grp], levels=groups, ordered=TRUE)
}


I am seeing unexpected results when grouping on the output of this function as a key and indexing with
.GRP
.

pop[, .(age_segment_id = .GRP, pop_count=.N), keyby=.(age_segment = toAgeGroups(age))]


returns:

age_segment age_segment_id pop_count
1: Under 40 1 1743
2: 40-64 3 2015
3: 65+ 2 1242


I would have expected the
age_segment_id
values to be
c(1,2,3)
, not
c(1,3,2)
, but
.GRP
seems set on order of occurrence in underlying data (as in
by=
order) rather than sorted order (as in
keyby=
).

I was planning on using
.GRP
as an index for some additional labelling, but instead I need to do something like:

pop[, .(pop_count=.N), keyby=.(age_segment = toAgeGroups(age))][, age_segment_id := .I][]


to get what I want.

Is this expected behavior? If so, is there a better workaround?

(v. 1.9.6)

Answer

There was a very recent change to how data.table works internally that fixes your problem, so install the current development version:

install.packages("data.table", type = "source",
                 repos = "http://Rdatatable.github.io/data.table")

And re-run your code:

library(data.table) #1.9.7+
pop[, .(age_segment_id = .GRP, pop_count=.N),
    keyby=.(age_segment = toAgeGroups(age))]
#    age_segment age_segment_id pop_count
# 1:    Under 40              1      1743
# 2:       40-64              2      2015
# 3:         65+              3      1242

For some more, see the discussion here. Basically, how by works internally returns sorted rows for each group, then re-sorts the table back to its original order.

The change recognized that this re-sort is unnecessary if keyby is specified, so now your approach works as you expected.

Before (through 1.9.6 & recent versions of 1.9.7), keyby would just re-sort the answer at the end by running setkey, as documented in ?data.table:

[keyby is the s]ame as by, but with an additional setkey() run on the by columns of the result.

Thus, on less-than-brand-new versions of data.table, you'd have to fix your code as:

pop[(order(age), .(age_segment_id = .GRP, pop_count=.N),
    keyby=.(age_segment = toAgeGroups(age))]