user299791 user299791 - 1 month ago 7
R Question

how to aggregate data frame and add 0 for categories not found

I have a data frame like:

> prova
sent weeknumber processed
1 100 1 1
2 23 1 0
3 254 1 1
4 321 2 0
5 1241 2 0
6 323 2 1
7 1221 3 1

structure(list(sent = c(100, 23, 254, 321, 1241, 323, 1221),
weeknumber = c(1, 1, 1, 2, 2, 2, 3), processed = c(1, 0,
1, 0, 0, 1, 1)), .Names = c("sent", "weeknumber", "processed"
), row.names = c(NA, -7L), class = "data.frame")


If I want to extract the number of Sent by week number for rows with processed = 0, I can do:

aggregate(prova[prova$processed==0,]$sent, by=list(prova[prova$processed==0,]$weeknumber), FUN = sum)
Group.1 x
1 1 23
2 2 1562


And if I want to extract the sum of Sent by week number when processed = 1 I do:

aggregate(prova[prova$processed==1,]$sent, by=list(prova[prova$processed==1,]$weeknumber), FUN = sum)
Group.1 x
1 1 354
2 2 323
3 3 1221


However, I'd like to find a way to have always the same result length, i.e. in case of processed=0, something like this:

Group.1 x
1 1 23
2 2 1562
3 3 0 // this is the new row I'd like to add


If I simply pass the whole list of possible week numbers, I get:

aggregate(prova[prova$processed==0,]$sent, by=list(prova$weeknumber), FUN = sum)
Error in aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length


Any hint/advice much appreciated!

Answer

We can use an if/else condition with data.table. Convert the 'data.frame' to 'data.table' (setDT(prova)), grouped by 'weeknumber', if there are not any 0 values in 'processed', return 0 or else get the sum of 'sent' where 'processed' is 0.

library(data.table)
setDT(prova)[, .(sent = if(!any(processed==0)) 0 
               else sum(sent[processed==0])), by = weeknumber]
#   weeknumber sent
#1:          1   23
#2:          2 1562
#3:          3    0

But, if we need the sum of 'sent' for each value of 'processed' grouped by 'weeknumber', a convenient option is dcast

dcast(setDT(prova), weeknumber~processed, value.var="sent", sum)
#  weeknumber    0    1
#1:          1   23  354
#2:          2 1562  323
#3:          3    0 1221

Or with xtabs from base R which also does the sum of 'sent' for each of the combinations of 'weeknumber' with 'processed'.

xtabs(sent~weeknumber + processed, prova)

If we are using aggregate, one option is merge the output of aggregate with the unique set of 'weeknumber' and then replace the NA elements in 'sent' to 0.

res <- merge(data.frame(weeknumber = unique(prova$weeknumber)), 
      aggregate(sent~weeknumber, prova, subset = processed ==0, FUN = sum),
     all.x=TRUE)
res$sent[is.na(res$sent)] <- 0
res
#   weeknumber sent
#1          1   23
#2          2 1562
#3          3    0
Comments