Jaywalker Jaywalker - 13 days ago 7
R Question

dplyr 'object not found' median only

This problem has me stumped.

I have the following data frame:

library(dplyr)

# approximation of data frame
x <- data.frame(doy = sample(c(seq(200, 300)), 20, replace = T),
year = sample(c("2000", "2005"), 20, replace = T),
phase = sample(c("pre", "post"), 20, replace = T))


and a simple 'summarize' function that takes in the column name as a variable, and works nicely:

getStats <- function(df, col) {
col <- as.name(col)
df %>%
group_by(year, phase) %>%
summarize(n = sum(!is.na(col)),
mean = mean(col, na.rm = T),
sd = sd(col, na.rm = T),
se = sd/sqrt(n))
}

> getStats(x, "doy")
Source: local data frame [4 x 6]
Groups: year [?]

year phase n mean sd se
<fctr> <fctr> <int> <dbl> <dbl> <dbl>
1 2000 post 8 248.625 30.42526 10.75695
2 2000 pre 2 290.000 14.14214 10.00000
3 2005 post 5 231.400 32.86031 14.69558
4 2005 pre 5 274.200 29.79429 13.32441


However, if I modify the function to get the median, it returns an error:

getStats <- function(df, col) {
col <- as.name(col)
df %>%
group_by(year, phase) %>%
summarize(n = sum(!is.na(col)),
mean = mean(col, na.rm = T),
med = median(col, na.rm = T), # new line
sd = sd(col, na.rm = T),
se = sd/sqrt(n))
}

> getStats(x, "doy")

Error in median (doy, na.rm = TRUE): object "doy" not found


I've tried a host of name and position changes, but all yield the same result: 'median' doesn't accept the column name as a passed variable. I assume I'm missing something so basic I'll do a face palm when someone points it out to me, but in the interim I feel like I'm losing my sanity. I appreciate any insights!

Answer

Your proximal problem may be that median doesn't have a ... argument, while mean does (I'm not sure why sd is working ... maybe an interaction between methods and ...?)

In any case, IMO the right way to handle this sort of problem is to use standard evaluation (i.e., not non-standard evaluation, i.e. use summarise_ rather than summarise, as illustrated in vignette("nse",package="dplyr")):

Illustrating how this works in the global environment rather than inside a function, but I think that shouldn't matter ...

col <- "doy"
funs <- c("n","mean","stats::median","sd","se")
## put together function calls
dots <- c(sprintf("sum(!is.na(%s))",col),
      sprintf("%s(%s,na.rm=TRUE)",funs[2:4],col),
      "sd/sqrt(n)")
names(dots) <- gsub("^.*::","",funs)  ## ugh
dots 
##                              n                            mean 
##              "sum(!is.na(doy))"          "mean(doy,na.rm=TRUE)" 
##                        median                              sd 
## "stats::median(doy,na.rm=TRUE)"            "sd(doy,na.rm=TRUE)" 
##                              se 
##                    "sd/sqrt(n)" 

x %>% 
    group_by(year, phase) %>% 
    summarise_(.dots=dots)

The only annoying thing here is that for some reason dplyr can't find median unless I call it as stats::median, which means we have to work a little harder to get nice column names. The standard-evaluation method is a little uglier, but that's the price you pay for this kind of flexibility.

Embedding this in a function, I would probably break off getStats in a different place, i.e.

 getStats <- function(data,col) {
   col <- deparse(substitute(col))
   funs <- c("n","mean","stats::median","sd","se")
   dots <- c(sprintf("sum(!is.na(%s))",col),
      sprintf("%s(%s,na.rm=TRUE)",funs[2:4],col),
      "sd/sqrt(n)")
   names(dots) <- gsub("^.*::","",funs)  ## ugh
   summarise_(data,.dots=dots)
}

x %>% group_by(year,phase) %>% getStats(doy)

This gives you more flexibility to do different groupings ...