Aaron - 1 year ago 114
R Question

# dplyr summarise() with special functions for edge cases

I would like to make a groupwise

`summarise()`
operation in dplyr, but where a different function applies if edge cases are met.

I have count data that looks like this. The concentration and the standard deviation are calculated like this:

``````library(dplyr)
testdata <- data_frame(sample = sort(rep(1:3, 4)),
volume = rep(c(1e-1, 1e-1, 1e-2, 1e-2), 3),
count = c(400, 400, 40, 40, 0, 0, 0, 0, 400, 400, 400, 400))

testdata %>%
group_by(sample) %>%
summarise(concentration = sum(count) / sum(volume),
sd   = sqrt(sum(count)))
``````

However, when making the calculation only counts with values between 25-250 are to be included. which I could achieve with:

``````testdata %>%
group_by(sample) %>%
filter((count >= 25) & (count <= 250)) %>%
summarise(concentration = sum(count) / sum(volume),
sd   = sqrt(sum(count)))
``````

But then samples 2 & 3 have no concentration.

The edge cases for each group might be calculated with something like:

``````if (all(count <= 25)){
summarise(concentration = 25 / min(volume),
sd = NA)
}
else if (all(count >= 250)){
summarise(concentration = 250 / max(volume),
sd = NA)
}
``````

Can such edge cases be integrated into the
`summarise()`
function?

I would ideally also like a flag to indicate an edge case which returns result = "OK" for all cases except edge cases that return:

``````if (all(count <= 25)){
summarise(concentration = 25 / min(volume),
sd = NA,
result = "LOW")
}
else if (all(count >= 250)){
summarise(concentration = 250 / max(volume),
sd = NA,
result = "HIGH")
}
``````

One way is to encode your logic within `summarise` using `ifelse`:

``````library(dplyr)
result <- testdata %>% group_by(sample) %>%
summarise(concentration = ifelse(all(count <= 25),
25 / min(volume),
ifelse(all(count >= 250),
250 / max(volume),
sum(count) / sum(volume))),
sd = ifelse(all(count <= 25),
NA,
ifelse(all(count >= 250),
NA,
sqrt(sum(count)))),
result = ifelse(all(count <= 25),
"LOW",
ifelse(all(count >= 250),
"HIGH",
"OK")))
print(result)
### A tibble: 3 x 4
##  sample concentration       sd result
##   <int>         <dbl>    <dbl>  <chr>
##1      1          4000 29.66479     OK
##2      2          2500       NA    LOW
##3      3          2500       NA   HIGH
``````

### Updated approach

Another approach, which is hopefully closer to what the OP asks, is to define a function:

``````summarise.func <- function(count, volume) {
if (all(count <= 25)) {
concentration <- 25 / min(volume)
sd <- NA
result <- "LOW"
} else if (all(count >= 250)) {
concentration <- 250 / max(volume)
sd <- NA
result <- "HIGH"
} else {
concentration <- sum(count) / sum(volume)
sd <- sqrt(sum(count))
result <- "OK"
}
data.frame(concentration=concentration, sd=sd, result=result, stringsAsFactors=FALSE)
}
``````

that handles both the regular case and the edge cases. The key is that this function return a `data.frame` containing the summarized results. Then, `summarise` will create a column that is a list containing these data frames that can then be `tidyr::unnest`ed:

``````library(dplyr)
library(tidyr)
result <- testdata %>% group_by(sample) %>%
summarise(csr=list(f(count, volume))) %>%
unnest(csr)
print(result)
### A tibble: 3 x 4
##  sample concentration       sd result
##   <int>         <dbl>    <dbl>  <chr>
##1      1          4000 29.66479     OK
##2      2          2500       NA    LOW
##3      3          2500       NA   HIGH
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download