Jaywalker - 6 months ago 32
R Question

This problem has me stumped.

I have the following data frame:

``````library(dplyr)

# approximation of data frame
x <- data.frame(doy = sample(c(seq(200, 300)), 20, replace = T),
year = sample(c("2000", "2005"), 20, replace = T),
phase = sample(c("pre", "post"), 20, replace = T))
``````

and a simple 'summarize' function that takes in the column name as a variable, and works nicely:

`````` getStats <- function(df, col) {
col <- as.name(col)
df %>%
group_by(year, phase) %>%
summarize(n = sum(!is.na(col)),
mean = mean(col, na.rm = T),
sd = sd(col, na.rm = T),
se = sd/sqrt(n))
}

> getStats(x, "doy")
Source: local data frame [4 x 6]
Groups: year [?]

year  phase     n    mean       sd       se
<fctr> <fctr> <int>   <dbl>    <dbl>    <dbl>
1   2000   post     8 248.625 30.42526 10.75695
2   2000    pre     2 290.000 14.14214 10.00000
3   2005   post     5 231.400 32.86031 14.69558
4   2005    pre     5 274.200 29.79429 13.32441
``````

However, if I modify the function to get the median, it returns an error:

`````` getStats <- function(df, col) {
col <- as.name(col)
df %>%
group_by(year, phase) %>%
summarize(n = sum(!is.na(col)),
mean = mean(col, na.rm = T),
med = median(col, na.rm = T), # new line
sd = sd(col, na.rm = T),
se = sd/sqrt(n))
}

> getStats(x, "doy")

``````

I've tried a host of name and position changes, but all yield the same result: 'median' doesn't accept the column name as a passed variable. I assume I'm missing something so basic I'll do a face palm when someone points it out to me, but in the interim I feel like I'm losing my sanity. I appreciate any insights!

Your proximal problem may be that `median` doesn't have a `...` argument, while `mean` does (I'm not sure why `sd` is working ... maybe an interaction between methods and `...`?)

In any case, IMO the right way to handle this sort of problem is to use standard evaluation (i.e., not non-standard evaluation, i.e. use `summarise_` rather than `summarise`, as illustrated in `vignette("nse",package="dplyr")`):

Illustrating how this works in the global environment rather than inside a function, but I think that shouldn't matter ...

``````col <- "doy"
funs <- c("n","mean","stats::median","sd","se")
## put together function calls
dots <- c(sprintf("sum(!is.na(%s))",col),
sprintf("%s(%s,na.rm=TRUE)",funs[2:4],col),
"sd/sqrt(n)")
names(dots) <- gsub("^.*::","",funs)  ## ugh
dots
##                              n                            mean
##              "sum(!is.na(doy))"          "mean(doy,na.rm=TRUE)"
##                        median                              sd
## "stats::median(doy,na.rm=TRUE)"            "sd(doy,na.rm=TRUE)"
##                              se
##                    "sd/sqrt(n)"

x %>%
group_by(year, phase) %>%
summarise_(.dots=dots)
``````

The only annoying thing here is that for some reason `dplyr` can't find `median` unless I call it as `stats::median`, which means we have to work a little harder to get nice column names. The standard-evaluation method is a little uglier, but that's the price you pay for this kind of flexibility.

Embedding this in a function, I would probably break off `getStats` in a different place, i.e.

`````` getStats <- function(data,col) {
col <- deparse(substitute(col))
funs <- c("n","mean","stats::median","sd","se")
dots <- c(sprintf("sum(!is.na(%s))",col),
sprintf("%s(%s,na.rm=TRUE)",funs[2:4],col),
"sd/sqrt(n)")
names(dots) <- gsub("^.*::","",funs)  ## ugh
summarise_(data,.dots=dots)
}

x %>% group_by(year,phase) %>% getStats(doy)
``````

This gives you more flexibility to do different groupings ...

Source (Stackoverflow)