andrasz andrasz - 1 year ago 86
R Question

Subset by group with data.table compared to aggregate a data.table

This is a follow up question to Subset by group with data.table using the same data.table:

library(data.table)

bdt <- as.data.table(baseball)

# Aggregating and loosing information on other columns
dt1 <- bdt[ , .(max_g = max(g)), by = id]
# Aggregating and keeping information on other columns
dt2 <- bdt[bdt[, .I[g == max(g)], by = id]$V1]


Why do
dt1
and
dt2
differ in number of rows?
Isn't dt2 supposed to have the same result just without loosing the respective information in the other columns?

Answer Source

As @Frank pointed out:

bdt[ , .(max_g = max(g)), by = id] provides you with the maximum value, while

bdt[bdt[ , .I[g == max(g)], by = id]$V1] identifies all rows that have this maximum.

See What is the difference between arg max and max? for a mathematical explanation and try this slim version in R:

library(data.table)
bdt <- as.data.table(baseball)

dt <- bdt[id == "woodge01"][order(-g)]
dt[ , .(max = max(g)), by = id]
dt[ dt[ , .I[g == max(g)], by = id]$V1 ]
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download