andrasz andrasz - 1 year ago 102
R Question

Subset by group with data.table compared to aggregate a data.table

This is a follow up question to Subset by group with data.table using the same data.table:


bdt <-

# Aggregating and loosing information on other columns
dt1 <- bdt[ , .(max_g = max(g)), by = id]
# Aggregating and keeping information on other columns
dt2 <- bdt[bdt[, .I[g == max(g)], by = id]$V1]

Why do
differ in number of rows?
Isn't dt2 supposed to have the same result just without loosing the respective information in the other columns?

Answer Source

As @Frank pointed out:

bdt[ , .(max_g = max(g)), by = id] provides you with the maximum value, while

bdt[bdt[ , .I[g == max(g)], by = id]$V1] identifies all rows that have this maximum.

See What is the difference between arg max and max? for a mathematical explanation and try this slim version in R:

bdt <-

dt <- bdt[id == "woodge01"][order(-g)]
dt[ , .(max = max(g)), by = id]
dt[ dt[ , .I[g == max(g)], by = id]$V1 ]
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download