andrasz - 1 year ago 86
R Question

# Subset by group with data.table compared to aggregate a data.table

This is a follow up question to Subset by group with data.table using the same data.table:

``````library(data.table)

bdt <- as.data.table(baseball)

# Aggregating and loosing information on other columns
dt1 <- bdt[ , .(max_g = max(g)), by = id]
# Aggregating and keeping information on other columns
dt2 <- bdt[bdt[, .I[g == max(g)], by = id]\$V1]
``````

Why do
`dt1`
and
`dt2`
differ in number of rows?
Isn't dt2 supposed to have the same result just without loosing the respective information in the other columns?

As @Frank pointed out:

`bdt[ , .(max_g = max(g)), by = id]` provides you with the maximum value, while

`bdt[bdt[ , .I[g == max(g)], by = id]\$V1]` identifies all rows that have this maximum.

See What is the difference between arg max and max? for a mathematical explanation and try this slim version in R:

``````library(data.table)
bdt <- as.data.table(baseball)

dt <- bdt[id == "woodge01"][order(-g)]
dt[ , .(max = max(g)), by = id]
dt[ dt[ , .I[g == max(g)], by = id]\$V1 ]
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download