Alex Spedding Alex Spedding - 1 month ago 9
R Question

R Given a list of same dimension data tables, produce a summary of the means of each cell

I'm finding it hard to put what I want into words so I will try to run through an example to explain it. Let's say I've repeated an experiment twice and have two tables:

[df1] [df2]
X Y X Y
2 3 4 1
5 2 2 4


These tables are stored in a list (where the list can contain more than two elements if necessary), and what I want to do is create an average of each cell in the tables across the list (or for a generalised version, apply any function I choose to the cells i.e. mad, sd, etc)

[df1] [df2] [dfMeans]
X Y X Y X Y
2 3 4 1 mean(2,4) mean(3,1)
5 2 2 4 mean(5,2) mean(2,4)


I have a code solution to my problem, but since this is in R there is most likely a cleaner way to do things:

df1 <- data.frame(X=c(2,3,4),Y=c(3,2,1))
df2 <- data.frame(X=c(5,1,3),Y=c(4,1,4))
df3 <- data.frame(X=c(2,7,4),Y=c(1,7,6))
dfList <- list(df1,df2,df3)

dfMeans <- data.frame(MeanX=c(NA,NA,NA),MeanY=c(NA,NA,NA))

for (rowIndex in 1:nrow(df1)) {
for (colIndex in 1:ncol(df1)) {
valuesAtCell <- c()
for (tableIndex in 1:length(dfList)) {
valuesAtCell <- c(valuesAtCell, dfList[[tableIndex]][rowIndex,colIndex])
}
dfMeans[rowIndex, colIndex] <- mean(valuesAtCell)
}
}

print(dfMeans)

Answer

Here is a data.table solution where the mean is applied row-wise across the data frames:

library(data.table)

dtList <- rbindlist(dfList, use.names = TRUE, idcol = TRUE)
dtList
   .id X Y
1:   1 2 3
2:   1 3 2
3:   1 4 1
4:   2 5 4
5:   2 1 1
6:   2 3 4
7:   3 2 1
8:   3 7 7
9:   3 4 6

dtList[, rn := 1:.N, by = .id][][, .(X = mean(X), Y = mean(Y)), by = rn]
   rn        X        Y
1:  1 3.000000 2.666667
2:  2 3.666667 3.333333
3:  3 3.666667 3.666667

You can replace the mean by another aggregation function, eg, median. The .id column numbers the original data frames each row was sourced from.

Edit

The solution can be extended to an arbitrary number of columns (provided column names and column order are identical in all data frames):

cn <- colnames(df1)
cn
[1] "X" "Y"

dtList[, rn := 1:.N, by = .id][, lapply(.SD, mean), by = rn, .SDcols = cn][, rn := NULL][]
          X        Y
1: 3.000000 2.666667
2: 3.666667 3.333333
3: 3.666667 3.666667

The column names are taken from one of the original data frames which adds to the flexibility of the solution. [, rn := NULL] removes the row numbers from the result, [] ensures the result ist printed.