pyg - 3 months ago 10

R Question

Suppose I'd like to calculate the mean, standard deviation, and *n* (number of non-NA values) for columns "dat_1" to "dat_3" of the following dataframe, grouped by the factors "fac_1" and "fac_2", such that separate dataframes for each statistic (or function) can be accessed from the result

`set.seed(1)`

df <- data.frame("fac_1" = c(rep("a", 5), rep("b", 4)),

"fac_2" = c("x", "x", "y","y", "y", "y", "x", "x", "x"),

"dat_1" = c(floor(runif(3, 0, 10)), NA, floor(runif(5, 0, 10))),

"dat_2" = floor(runif(9, 10, 20)),

"dat_3" = floor(runif(9, 20, 30)))

This can be achieved one function at a time using

`plyr`

`ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { colMeans(x[, 3:5], na.rm = T) } ) # mean`

ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { psych::SD(x[, 3:5], na.rm = T) } ) # standrd deviation -- note uses SD from the 'psych' package

ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { colSums(!is.na(x[, 3:5])) } ) # number of non-NA values

but this becomes cumbersome when using multiple functions, especially when factors and columns of interest must be changed. I'm wondering if there's an alternative (a one-liner, perhaps).

Aggregate works

`aggregate( x = df[, c(3:5)], by = df[, c(1,2)], FUN = function(x) c(n = length( !is.na(x) ), mean = mean(x, na.rm = T), sd = sd(x, na.rm = T) ) )`

but 'disaggregating' the result (into separate dataframes for each statistic) becomes awkward.

Recently I've come across

`dplyr`

`df %>% group_by(fac_1, fac_2) %>% summarise_each(funs(n = length( !is.na(.) ), mean(., na.rm = TRUE), sd(., na.rm = TRUE) )) # using dplyr`

however I'd like to be able to paste factors into

`group_by()`

Any help or ideas? Thanks

Answer

Passing vectors or lists to dplyr functions can be tricky (see this vignette.) In short, it involves adding an additional underscore, to use the standard evaluation version of a function, and then passing a vector or list to the `.dots`

argument.

```
factorsToSummarise <-
c('fac_1', 'fac_2')
# extra underscore
# |
df %>% # v
group_by_(.dots = factorsToSummarise) %>%
summarise_each(funs(n = length( !is.na(.) ),
mean(., na.rm = TRUE),
sd(., na.rm = TRUE)
)) # using dplyr
```