ptenax ptenax - 13 days ago 4
R Question

aggregate over rows for many columns

I have a large dataframe in the same format as

d
below, with count data for three columns nested within two different factors.

elevation distance sp1 sp2 spn
1500 0 2 2 5
1500 0 2 1 5
1500 50 2 2 5
1500 50 2 2 6
2000 0 9 2 5
2000 0 7 2 2
2000 50 4 3 6
2000 50 4 3 4


Notice that there are two replicate rows for each level of factor
d$distance
.

I would like to aggregate those replicate rows for each distance level within each elevation by summing in each column, so it ends up like this:

elevation distance sp1 sp2 spn
1500 0 4 3 10
1500 50 4 4 11
2000 0 16 4 7
2000 50 8 6 10


I can do it easily for one column, eg
sp1
.

d2 <-data.frame(aggregate(sp1 ~ elevation + distance, data = d, sum))


Can I avoid a
for
loop to get a new dataframe in the same format that includes all columns
sp
,
sp2
,
spn
? Trying to adapt various other solutions I have seen online have ended in failure because certain bits of my own brain are missing. Thanks.

Answer

We can use summarise_each from dplyr. It would be fast and efficient.

library(dplyr)
df1 %>%
   group_by(elevation, distance) %>% 
   summarise_each(funs(sum))
#  elevation distance   sp1   sp2   spn
#      <int>    <int> <int> <int> <int>
#1      1500        0     4     3    10
#2      1500       50     4     4    11
#3      2000        0    16     4     7
#4      2000       50     8     6    10

Or another option is data.table

library(data.table)
setDT(df1)[, lapply(.SD, sum) , by = .(elevation, distance)]

A base R approach would be with aggregate would be to use . to specify all the columns except the ones specified in the rhs of ~. But, on a large dataset, this would be slow.

aggregate(.~elevation+distance, df1, sum)
#   elevation distance sp1 sp2 spn
#1      1500        0   4   3  10
#2      2000        0  16   4   7
#3      1500       50   4   4  11
#4      2000       50   8   6  10

NOTE: If there are NA values, use na.rm = TRUE from the sum.

As @user2100721 suggested, we can also use by from base R

by(df1[3:5], df1[1:2], FUN = colSums)

The output will be a list and it can be converted to matrix by rbinding the list elements.

Comments