pyg pyg - 3 months ago 10
R Question

Multiple statistics of multiple columns by factor(s)

Suppose I'd like to calculate the mean, standard deviation, and n (number of non-NA values) for columns "dat_1" to "dat_3" of the following dataframe, grouped by the factors "fac_1" and "fac_2", such that separate dataframes for each statistic (or function) can be accessed from the result

set.seed(1)
df <- data.frame("fac_1" = c(rep("a", 5), rep("b", 4)),
"fac_2" = c("x", "x", "y","y", "y", "y", "x", "x", "x"),
"dat_1" = c(floor(runif(3, 0, 10)), NA, floor(runif(5, 0, 10))),
"dat_2" = floor(runif(9, 10, 20)),
"dat_3" = floor(runif(9, 20, 30)))


This can be achieved one function at a time using
plyr
, as such

ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { colMeans(x[, 3:5], na.rm = T) } ) # mean
ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { psych::SD(x[, 3:5], na.rm = T) } ) # standrd deviation -- note uses SD from the 'psych' package
ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { colSums(!is.na(x[, 3:5])) } ) # number of non-NA values


but this becomes cumbersome when using multiple functions, especially when factors and columns of interest must be changed. I'm wondering if there's an alternative (a one-liner, perhaps).

Aggregate works

aggregate( x = df[, c(3:5)], by = df[, c(1,2)], FUN = function(x) c(n = length( !is.na(x) ), mean = mean(x, na.rm = T), sd = sd(x, na.rm = T) ) )


but 'disaggregating' the result (into separate dataframes for each statistic) becomes awkward.

Recently I've come across
dplyr
. The following seems to work

df %>% group_by(fac_1, fac_2) %>% summarise_each(funs(n = length( !is.na(.) ), mean(., na.rm = TRUE), sd(., na.rm = TRUE) )) # using dplyr


however I'd like to be able to paste factors into
group_by()
, and I've not found a way to do so.

Any help or ideas? Thanks

Answer

Passing vectors or lists to dplyr functions can be tricky (see this vignette.) In short, it involves adding an additional underscore, to use the standard evaluation version of a function, and then passing a vector or list to the .dots argument.

factorsToSummarise <-
  c('fac_1', 'fac_2')

   # extra underscore
        # |
df %>%  # v
  group_by_(.dots = factorsToSummarise) %>% 
  summarise_each(funs(n = length( !is.na(.) ), 
                      mean(., na.rm = TRUE), 
                      sd(., na.rm = TRUE) 
  )) # using dplyr