Aaron Aaron - 2 months ago 18
R Question

dplyr summarise() with special functions for edge cases

I would like to make a groupwise

summarise()
operation in dplyr, but where a different function applies if edge cases are met.

I have count data that looks like this. The concentration and the standard deviation are calculated like this:

library(dplyr)
testdata <- data_frame(sample = sort(rep(1:3, 4)),
volume = rep(c(1e-1, 1e-1, 1e-2, 1e-2), 3),
count = c(400, 400, 40, 40, 0, 0, 0, 0, 400, 400, 400, 400))

testdata %>%
group_by(sample) %>%
summarise(concentration = sum(count) / sum(volume),
sd = sqrt(sum(count)))


However, when making the calculation only counts with values between 25-250 are to be included. which I could achieve with:

testdata %>%
group_by(sample) %>%
filter((count >= 25) & (count <= 250)) %>%
summarise(concentration = sum(count) / sum(volume),
sd = sqrt(sum(count)))


But then samples 2 & 3 have no concentration.

The edge cases for each group might be calculated with something like:

if (all(count <= 25)){
summarise(concentration = 25 / min(volume),
sd = NA)
}
else if (all(count >= 250)){
summarise(concentration = 250 / max(volume),
sd = NA)
}


Can such edge cases be integrated into the
summarise()
function?

I would ideally also like a flag to indicate an edge case which returns result = "OK" for all cases except edge cases that return:

if (all(count <= 25)){
summarise(concentration = 25 / min(volume),
sd = NA,
result = "LOW")
}
else if (all(count >= 250)){
summarise(concentration = 250 / max(volume),
sd = NA,
result = "HIGH")
}

Answer

One way is to encode your logic within summarise using ifelse:

library(dplyr)
result <- testdata %>% group_by(sample) %>%
                       summarise(concentration = ifelse(all(count <= 25), 
                                                        25 / min(volume),
                                                        ifelse(all(count >= 250), 
                                                               250 / max(volume), 
                                                               sum(count) / sum(volume))),
                                 sd = ifelse(all(count <= 25), 
                                             NA, 
                                             ifelse(all(count >= 250), 
                                                    NA, 
                                                    sqrt(sum(count)))),
                                 result = ifelse(all(count <= 25), 
                                                 "LOW", 
                                                 ifelse(all(count >= 250),
                                                        "HIGH", 
                                                        "OK")))
print(result)
### A tibble: 3 x 4
##  sample concentration       sd result
##   <int>         <dbl>    <dbl>  <chr>
##1      1          4000 29.66479     OK
##2      2          2500       NA    LOW
##3      3          2500       NA   HIGH

Updated approach

Another approach, which is hopefully closer to what the OP asks, is to define a function:

summarise.func <- function(count, volume) {
  if (all(count <= 25)) {
    concentration <- 25 / min(volume)
    sd <- NA
    result <- "LOW"
  } else if (all(count >= 250)) {
    concentration <- 250 / max(volume)
    sd <- NA
    result <- "HIGH"
  } else {
    concentration <- sum(count) / sum(volume)
    sd <- sqrt(sum(count))
    result <- "OK"
  }
  data.frame(concentration=concentration, sd=sd, result=result, stringsAsFactors=FALSE)
}

that handles both the regular case and the edge cases. The key is that this function return a data.frame containing the summarized results. Then, summarise will create a column that is a list containing these data frames that can then be tidyr::unnested:

library(dplyr)
library(tidyr)
result <- testdata %>% group_by(sample) %>%
                       summarise(csr=list(f(count, volume))) %>%
                       unnest(csr)
print(result)
### A tibble: 3 x 4
##  sample concentration       sd result
##   <int>         <dbl>    <dbl>  <chr>
##1      1          4000 29.66479     OK
##2      2          2500       NA    LOW
##3      3          2500       NA   HIGH