Jaywalker - 3 months ago 22

R Question

This problem has me stumped.

I have the following data frame:

`library(dplyr)`

# approximation of data frame

x <- data.frame(doy = sample(c(seq(200, 300)), 20, replace = T),

year = sample(c("2000", "2005"), 20, replace = T),

phase = sample(c("pre", "post"), 20, replace = T))

and a simple 'summarize' function that takes in the column name as a variable, and works nicely:

`getStats <- function(df, col) {`

col <- as.name(col)

df %>%

group_by(year, phase) %>%

summarize(n = sum(!is.na(col)),

mean = mean(col, na.rm = T),

sd = sd(col, na.rm = T),

se = sd/sqrt(n))

}

> getStats(x, "doy")

Source: local data frame [4 x 6]

Groups: year [?]

year phase n mean sd se

<fctr> <fctr> <int> <dbl> <dbl> <dbl>

1 2000 post 8 248.625 30.42526 10.75695

2 2000 pre 2 290.000 14.14214 10.00000

3 2005 post 5 231.400 32.86031 14.69558

4 2005 pre 5 274.200 29.79429 13.32441

However, if I modify the function to get the median, it returns an error:

`getStats <- function(df, col) {`

col <- as.name(col)

df %>%

group_by(year, phase) %>%

summarize(n = sum(!is.na(col)),

mean = mean(col, na.rm = T),

med = median(col, na.rm = T), # new line

sd = sd(col, na.rm = T),

se = sd/sqrt(n))

}

> getStats(x, "doy")

Error in median (doy, na.rm = TRUE): object "doy" not found

I've tried a host of name and position changes, but all yield the same result: 'median' doesn't accept the column name as a passed variable. I assume I'm missing something so basic I'll do a face palm when someone points it out to me, but in the interim I feel like I'm losing my sanity. I appreciate any insights!

Answer

Your proximal problem may be that `median`

doesn't have a `...`

argument, while `mean`

does (I'm not sure why `sd`

is working ... maybe an interaction between methods and `...`

?)

In any case, IMO the right way to handle this sort of problem is to use *standard* evaluation (i.e., *not* non-standard evaluation, i.e. use `summarise_`

rather than `summarise`

, as illustrated in `vignette("nse",package="dplyr")`

):

Illustrating how this works in the global environment rather than inside a function, but I think that shouldn't matter ...

```
col <- "doy"
funs <- c("n","mean","stats::median","sd","se")
## put together function calls
dots <- c(sprintf("sum(!is.na(%s))",col),
sprintf("%s(%s,na.rm=TRUE)",funs[2:4],col),
"sd/sqrt(n)")
names(dots) <- gsub("^.*::","",funs) ## ugh
dots
## n mean
## "sum(!is.na(doy))" "mean(doy,na.rm=TRUE)"
## median sd
## "stats::median(doy,na.rm=TRUE)" "sd(doy,na.rm=TRUE)"
## se
## "sd/sqrt(n)"
x %>%
group_by(year, phase) %>%
summarise_(.dots=dots)
```

The only annoying thing here is that for some reason `dplyr`

can't find `median`

unless I call it as `stats::median`

, which means we have to work a little harder to get nice column names. The standard-evaluation method is a little uglier, but that's the price you pay for this kind of flexibility.

Embedding this in a function, I would probably break off `getStats`

in a different place, i.e.

```
getStats <- function(data,col) {
col <- deparse(substitute(col))
funs <- c("n","mean","stats::median","sd","se")
dots <- c(sprintf("sum(!is.na(%s))",col),
sprintf("%s(%s,na.rm=TRUE)",funs[2:4],col),
"sd/sqrt(n)")
names(dots) <- gsub("^.*::","",funs) ## ugh
summarise_(data,.dots=dots)
}
x %>% group_by(year,phase) %>% getStats(doy)
```

This gives you more flexibility to do different groupings ...