C8H10N4O2 C8H10N4O2 - 9 months ago 64
R Question

Unexpected .GRP sequence in data.table

Given a

such as:

n = 5000
pop = data.table(id=1:n, age=sample(18:80, n, replace=TRUE))

and a function which converts a numeric vector into an ordered factor, such as:

toAgeGroups <- function(x){
groups=c('Under 40','40-64','65+')
grp = findInterval(x, c(40,65)) +1
factor(groups[grp], levels=groups, ordered=TRUE)

I am seeing unexpected results when grouping on the output of this function as a key and indexing with

pop[, .(age_segment_id = .GRP, pop_count=.N), keyby=.(age_segment = toAgeGroups(age))]


age_segment age_segment_id pop_count
1: Under 40 1 1743
2: 40-64 3 2015
3: 65+ 2 1242

I would have expected the
values to be
, not
, but
seems set on order of occurrence in underlying data (as in
order) rather than sorted order (as in

I was planning on using
as an index for some additional labelling, but instead I need to do something like:

pop[, .(pop_count=.N), keyby=.(age_segment = toAgeGroups(age))][, age_segment_id := .I][]

to get what I want.

Is this expected behavior? If so, is there a better workaround?

(v. 1.9.6)

Answer Source

There was a very recent change to how data.table works internally that fixes your problem, so install the current development version:

install.packages("data.table", type = "source",
                 repos = "http://Rdatatable.github.io/data.table")

And re-run your code:

library(data.table) #1.9.7+
pop[, .(age_segment_id = .GRP, pop_count=.N),
    keyby=.(age_segment = toAgeGroups(age))]
#    age_segment age_segment_id pop_count
# 1:    Under 40              1      1743
# 2:       40-64              2      2015
# 3:         65+              3      1242

For some more, see the discussion here. Basically, how by works internally returns sorted rows for each group, then re-sorts the table back to its original order.

The change recognized that this re-sort is unnecessary if keyby is specified, so now your approach works as you expected.

Before (through 1.9.6 & recent versions of 1.9.7), keyby would just re-sort the answer at the end by running setkey, as documented in ?data.table:

[keyby is the s]ame as by, but with an additional setkey() run on the by columns of the result.

Thus, on less-than-brand-new versions of data.table, you'd have to fix your code as:

pop[(order(age), .(age_segment_id = .GRP, pop_count=.N),
    keyby=.(age_segment = toAgeGroups(age))]