Alexander Alexander - 3 months ago 8
R Question

Conditional row summing inside of group

I have some trouble with summing of rows in my df, I want to sum

BAD
and
UGLY
rows inside of each group if they exist!

group = c(seq(1,1.4,0.2),rep(seq(1.6,2,0.2),c(3,3,2)))
clas=c(rep("BAD",3),rep(c("BAD","GOOD","UGLY"),2),rep(c("BAD","GOOD"),1))
n=c(rep(1000,3),96,180,715,190,184,26,124,874)
df <- data.frame(group,clas,n)

> df
# group clas n
#1 1.0 BAD 1000
#2 1.2 BAD 1000
#3 1.4 BAD 1000
#4 1.6 BAD 96
#5 1.6 GOOD 180
#6 1.6 UGLY 715
#7 1.8 BAD 190
#8 1.8 GOOD 184
#9 1.8 UGLY 26
#10 2.0 BAD 124
#11 2.0 GOOD 874


I tried this but of course it didnt work because of my poor logic!

library(dplyr)
df %>%
group_by(group) %>%
mutate(sum = ifelse(all(clas=="BAD"),n,ifelse(with(clas=="BAD"&clas=="UGLY"),n["BAD"]+n["UGLY"],"NA")))



Error: invalid 'envir' argument of type 'logical'


The expected output if the code can be fixed,

> df
# group clas n sum
#1 1.0 BAD 1000 1000
#2 1.2 BAD 1000 1000
#3 1.4 BAD 1000 1000
#4 1.6 BAD 96 811
#5 1.6 GOOD 180 811
#6 1.6 UGLY 715 811
#7 1.8 BAD 190 216
#8 1.8 GOOD 184 216
#9 1.8 UGLY 26 216
#10 2.0 BAD 124 124
#11 2.0 GOOD 874 124


thanks!

Answer

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'group', get a logical index based on the 'BAD', 'UGLY' elements in 'clas' using %in%, subset the 'n' with this index, get the sum and assign (:=) it to a new column 'Sum'.

library(data.table)
setDT(df)[,  Sum := sum(n[as.character(clas) %chin% c("BAD", "UGLY")]), by = group]
df
#   group clas    n  Sum
#1:   1.0  BAD 1000 1000
#2:   1.2  BAD 1000 1000
#3:   1.4  BAD 1000 1000
#4:   1.6  BAD   96  811
#5:   1.6 GOOD  180  811
#6:   1.6 UGLY  715  811
#7:   1.8  BAD  190  216
#8:   1.8 GOOD  184  216
#9:   1.8 UGLY   26  216
#10:   2.0  BAD  124  124
#11:   2.0 GOOD  874  124

Benchmarks

On a 1e6 dataset, the benchmarks are

set.seed(24)
df1 <- data.frame(group = sample(1:1000, 1e6, replace=TRUE), 
         clas = sample(c("BAD", "GOOD", "UGLY"), 1e6, replace=TRUE),
         n = sample(100:1000, 1e6, replace=TRUE))


df2 <- copy(df1)

system.time(setDT(df1)[,  Sum := sum(n[as.character(clas) %chin% c("BAD", "UGLY")]), by = group])
#    user  system elapsed 
#   0.04    0.02    0.06 


system.time(merge(df2, aggregate(n~group, df2[!df2$clas == "GOOD", ], sum), by = "group", 
                                                suffixes = c("", "Sum")))
#   user  system elapsed 
#   5.00    0.16    5.17 

If we increase the number of rows from 1e6 to 1e7

system.time(setDT(df1)[,  Sum := sum(n[as.character(clas) %chin% c("BAD", "UGLY")]), by = group])
#   user  system elapsed 
#   0.65    0.00    0.66 
system.time(merge(df2, aggregate(n~group, df2[!df2$clas == "GOOD", ], sum), by = "group", 
                                                suffixes = c("", "Sum")))
#   user  system elapsed 
#  58.00    1.58   59.78