pyg - 1 year ago 114

R Question

Suppose I'd like to calculate the mean, standard deviation, and *n* (number of non-NA values) for columns "dat_1" to "dat_3" of the following dataframe, grouped by the factors "fac_1" and "fac_2", such that separate dataframes for each statistic (or function) can be accessed from the result

`set.seed(1)`

df <- data.frame("fac_1" = c(rep("a", 5), rep("b", 4)),

"fac_2" = c("x", "x", "y","y", "y", "y", "x", "x", "x"),

"dat_1" = c(floor(runif(3, 0, 10)), NA, floor(runif(5, 0, 10))),

"dat_2" = floor(runif(9, 10, 20)),

"dat_3" = floor(runif(9, 20, 30)))

This can be achieved one function at a time using

`plyr`

`ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { colMeans(x[, 3:5], na.rm = T) } ) # mean`

ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { psych::SD(x[, 3:5], na.rm = T) } ) # standrd deviation -- note uses SD from the 'psych' package

ddply(.data = df, .variables = .(df$fac_1, df$fac_2), .fun = function(x) { colSums(!is.na(x[, 3:5])) } ) # number of non-NA values

but this becomes cumbersome when using multiple functions, especially when factors and columns of interest must be changed. I'm wondering if there's an alternative (a one-liner, perhaps).

Aggregate works

`aggregate( x = df[, c(3:5)], by = df[, c(1,2)], FUN = function(x) c(n = length( !is.na(x) ), mean = mean(x, na.rm = T), sd = sd(x, na.rm = T) ) )`

but 'disaggregating' the result (into separate dataframes for each statistic) becomes awkward.

Recently I've come across

`dplyr`

`df %>% group_by(fac_1, fac_2) %>% summarise_each(funs(n = length( !is.na(.) ), mean(., na.rm = TRUE), sd(., na.rm = TRUE) )) # using dplyr`

however I'd like to be able to paste factors into

`group_by()`

Any help or ideas? Thanks

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

Passing vectors or lists to dplyr functions can be tricky (see this vignette.) In short, it involves adding an additional underscore, to use the standard evaluation version of a function, and then passing a vector or list to the `.dots`

argument.

```
factorsToSummarise <-
c('fac_1', 'fac_2')
# extra underscore
# |
df %>% # v
group_by_(.dots = factorsToSummarise) %>%
summarise_each(funs(n = length( !is.na(.) ),
mean(., na.rm = TRUE),
sd(., na.rm = TRUE)
)) # using dplyr
```

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**