Alexander - 1 month ago 5x
R Question

# Conditional row summing inside of group

I have some trouble with summing of rows in my df, I want to sum

`BAD`
and
`UGLY`
rows inside of each group if they exist!

``````group = c(seq(1,1.4,0.2),rep(seq(1.6,2,0.2),c(3,3,2)))
n=c(rep(1000,3),96,180,715,190,184,26,124,874)
df <- data.frame(group,clas,n)

> df
#   group clas    n
#5    1.6 GOOD  180
#6    1.6 UGLY  715
#8    1.8 GOOD  184
#9    1.8 UGLY   26
#11   2.0 GOOD  874
``````

I tried this but of course it didnt work because of my poor logic!

``````library(dplyr)
df %>%
group_by(group) %>%
``````

Error: invalid 'envir' argument of type 'logical'

The expected output if the code can be fixed,

``````> df
#   group clas    n  sum
#5    1.6 GOOD  180    811
#6    1.6 UGLY  715    811
#8    1.8 GOOD  184    216
#9    1.8 UGLY   26    216
#11   2.0 GOOD  874    124
``````

thanks!

We can use `data.table`. Convert the 'data.frame' to 'data.table' (`setDT(df)`), grouped by 'group', get a logical index based on the 'BAD', 'UGLY' elements in 'clas' using `%in%`, subset the 'n' with this index, get the `sum` and assign (`:=`) it to a new column 'Sum'.

``````library(data.table)
setDT(df)[,  Sum := sum(n[as.character(clas) %chin% c("BAD", "UGLY")]), by = group]
df
#   group clas    n  Sum
#5:   1.6 GOOD  180  811
#6:   1.6 UGLY  715  811
#8:   1.8 GOOD  184  216
#9:   1.8 UGLY   26  216
#11:   2.0 GOOD  874  124
``````

### Benchmarks

On a 1e6 dataset, the benchmarks are

``````set.seed(24)
df1 <- data.frame(group = sample(1:1000, 1e6, replace=TRUE),
clas = sample(c("BAD", "GOOD", "UGLY"), 1e6, replace=TRUE),
n = sample(100:1000, 1e6, replace=TRUE))

df2 <- copy(df1)

system.time(setDT(df1)[,  Sum := sum(n[as.character(clas) %chin% c("BAD", "UGLY")]), by = group])
#    user  system elapsed
#   0.04    0.02    0.06

system.time(merge(df2, aggregate(n~group, df2[!df2\$clas == "GOOD", ], sum), by = "group",
suffixes = c("", "Sum")))
#   user  system elapsed
#   5.00    0.16    5.17
``````

If we increase the number of rows from 1e6 to 1e7

``````system.time(setDT(df1)[,  Sum := sum(n[as.character(clas) %chin% c("BAD", "UGLY")]), by = group])
#   user  system elapsed
#   0.65    0.00    0.66
system.time(merge(df2, aggregate(n~group, df2[!df2\$clas == "GOOD", ], sum), by = "group",
suffixes = c("", "Sum")))
#   user  system elapsed
#  58.00    1.58   59.78
``````
Source (Stackoverflow)