emehex emehex - 1 month ago 8
R Question

Efficient resampling for the sum of specific values in a dataframe

My data looks like this:

df <- data.frame(
x = c("dog", "dog", "dog", "cat", "cat", "fish", "fish", "fish", "squid", "squid", "squid"),
y = c(10, 11, 6, 3, 4, 5, 5, 9, 14, 33, 16)
)


I want to iterate through the data and grab one value for each animal in some "inclusion/filter" list and then sum them together.

For instance, maybe I just care about dog, cat, and fish.

animals <- c("dog", "cat", "fish")


In resample 1, I could get 10, 4, 9 (sum = 23) and in resample 2 I could get 6, 3, 5 (sum = 14).

I just whipped up a really janky replicate/for function that leans on
dplyr
, but it seems super inefficient:

ani_samp <- function(animals){

total <- 0
for (i in animals) {

v <- df %>%
filter(x == i) %>%
sample_n(1) %>%
select(y) %>%
as.numeric()

total <- total + v
}
return(total)
}

replicate(1000,ani_samp(animals))


How might I improve this resampling/pseudo-bootstrap code?

Answer

I'm not sure if this much better (don't have time for benchmarks), but you could avoid the double loop here. You could first filter by animals (and hence work on a subset) and then sample n samples only once from each group. If you like dplyr, here's a possible dplyr/tidyr version

library(tidyr)
library(dplyr)

ani_samp <- function(animals, n){
  df %>%
    filter(x %in% animals) %>% # Work on a subset
    group_by(x) %>%
    sample_n(n, replace = TRUE) %>% # sample only once per each group
    group_by(x) %>%
    mutate(id = row_number()) %>% # Create an index for rowSums
    spread(x, y) %>% # Convert to wide format for rowSums
    mutate(res = rowSums(.[-1])) %>% # Sum everything at once
    .$res # You don't need this if you want a data.frame result instead
} 

set.seed(123) # For reproducible output
ani_samp(animals, 10)
# [1] 18 24 14 24 19 18 19 19 19 14
Comments