C8H10N4O2 - 1 year ago 93
R Question

# Unexpected .GRP sequence in data.table

Given a

`data.table`
such as:

``````library(data.table)
n = 5000
set.seed(123)
pop = data.table(id=1:n, age=sample(18:80, n, replace=TRUE))
``````

and a function which converts a numeric vector into an ordered factor, such as:

``````toAgeGroups <- function(x){
groups=c('Under 40','40-64','65+')
grp = findInterval(x, c(40,65)) +1
factor(groups[grp], levels=groups, ordered=TRUE)
}
``````

I am seeing unexpected results when grouping on the output of this function as a key and indexing with
`.GRP`
.

``````pop[, .(age_segment_id = .GRP, pop_count=.N), keyby=.(age_segment = toAgeGroups(age))]
``````

returns:

``````   age_segment age_segment_id pop_count
1:    Under 40              1      1743
2:       40-64              3      2015
3:         65+              2      1242
``````

I would have expected the
`age_segment_id`
values to be
`c(1,2,3)`
, not
`c(1,3,2)`
, but
`.GRP`
seems set on order of occurrence in underlying data (as in
`by=`
order) rather than sorted order (as in
`keyby=`
).

I was planning on using
`.GRP`
as an index for some additional labelling, but instead I need to do something like:

``````pop[, .(pop_count=.N), keyby=.(age_segment = toAgeGroups(age))][, age_segment_id := .I][]
``````

to get what I want.

Is this expected behavior? If so, is there a better workaround?

(v. 1.9.6)

There was a very recent change to how `data.table` works internally that fixes your problem, so install the current development version:

``````install.packages("data.table", type = "source",
repos = "http://Rdatatable.github.io/data.table")
``````

``````library(data.table) #1.9.7+
pop[, .(age_segment_id = .GRP, pop_count=.N),
keyby=.(age_segment = toAgeGroups(age))]
#    age_segment age_segment_id pop_count
# 1:    Under 40              1      1743
# 2:       40-64              2      2015
# 3:         65+              3      1242
``````

For some more, see the discussion here. Basically, how `by` works internally returns sorted rows for each group, then re-sorts the table back to its original order.

The change recognized that this re-sort is unnecessary if `keyby` is specified, so now your approach works as you expected.

Before (through 1.9.6 & recent versions of 1.9.7), `keyby` would just re-sort the answer at the end by running `setkey`, as documented in `?data.table`:

[`keyby` is the s]ame as `by`, but with an additional `setkey()` run on the `by` columns of the result.

Thus, on less-than-brand-new versions of `data.table`, you'd have to fix your code as:

``````pop[(order(age), .(age_segment_id = .GRP, pop_count=.N),
keyby=.(age_segment = toAgeGroups(age))]
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download