pyg - 1 year ago 114
R Question

# Multiple statistics of multiple columns by factor(s)

Suppose I'd like to calculate the mean, standard deviation, and n (number of non-NA values) for columns "dat_1" to "dat_3" of the following dataframe, grouped by the factors "fac_1" and "fac_2", such that separate dataframes for each statistic (or function) can be accessed from the result

``````set.seed(1)
df <- data.frame("fac_1" = c(rep("a", 5), rep("b", 4)),
"fac_2" = c("x", "x", "y","y", "y", "y", "x", "x", "x"),
"dat_1" = c(floor(runif(3, 0, 10)), NA, floor(runif(5, 0, 10))),
"dat_2" = floor(runif(9, 10, 20)),
"dat_3" = floor(runif(9, 20, 30)))
``````

This can be achieved one function at a time using
`plyr`
, as such

``````ddply(.data = df, .variables = .(df\$fac_1, df\$fac_2), .fun = function(x) { colMeans(x[, 3:5], na.rm = T) } ) # mean
ddply(.data = df, .variables = .(df\$fac_1, df\$fac_2), .fun = function(x) { psych::SD(x[, 3:5], na.rm = T) } ) # standrd deviation -- note uses SD from the 'psych' package
ddply(.data = df, .variables = .(df\$fac_1, df\$fac_2), .fun = function(x) { colSums(!is.na(x[, 3:5])) } ) # number of non-NA values
``````

but this becomes cumbersome when using multiple functions, especially when factors and columns of interest must be changed. I'm wondering if there's an alternative (a one-liner, perhaps).

Aggregate works

``````aggregate( x = df[, c(3:5)], by = df[, c(1,2)], FUN = function(x) c(n = length( !is.na(x) ), mean = mean(x, na.rm = T), sd = sd(x, na.rm = T) ) )
``````

but 'disaggregating' the result (into separate dataframes for each statistic) becomes awkward.

Recently I've come across
`dplyr`
. The following seems to work

``````df %>% group_by(fac_1, fac_2) %>% summarise_each(funs(n = length( !is.na(.) ), mean(., na.rm = TRUE), sd(., na.rm = TRUE) )) # using dplyr
``````

however I'd like to be able to paste factors into
`group_by()`

Any help or ideas? Thanks

Passing vectors or lists to dplyr functions can be tricky (see this vignette.) In short, it involves adding an additional underscore, to use the standard evaluation version of a function, and then passing a vector or list to the `.dots` argument.

``````factorsToSummarise <-
c('fac_1', 'fac_2')

# extra underscore
# |
df %>%  # v
group_by_(.dots = factorsToSummarise) %>%
summarise_each(funs(n = length( !is.na(.) ),
mean(., na.rm = TRUE),
sd(., na.rm = TRUE)
)) # using dplyr
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download