emehex - 1 year ago 66
R Question

# Efficient resampling for the sum of specific values in a dataframe

My data looks like this:

``````df <- data.frame(
x = c("dog", "dog", "dog", "cat", "cat", "fish", "fish", "fish", "squid", "squid", "squid"),
y = c(10, 11, 6, 3, 4, 5, 5, 9, 14, 33, 16)
)
``````

I want to iterate through the data and grab one value for each animal in some "inclusion/filter" list and then sum them together.

For instance, maybe I just care about dog, cat, and fish.

``````animals <- c("dog", "cat", "fish")
``````

In resample 1, I could get 10, 4, 9 (sum = 23) and in resample 2 I could get 6, 3, 5 (sum = 14).

I just whipped up a really janky replicate/for function that leans on
`dplyr`
, but it seems super inefficient:

``````ani_samp <- function(animals){

total <- 0
for (i in animals) {

v <- df %>%
filter(x == i) %>%
sample_n(1) %>%
select(y) %>%
as.numeric()

total <- total + v
}
return(total)
}

replicate(1000,ani_samp(animals))
``````

How might I improve this resampling/pseudo-bootstrap code?

I'm not sure if this much better (don't have time for benchmarks), but you could avoid the double loop here. You could first filter by `animals` (and hence work on a subset) and then sample `n` samples only once from each group. If you like `dplyr`, here's a possible `dplyr/tidyr` version

``````library(tidyr)
library(dplyr)

ani_samp <- function(animals, n){
df %>%
filter(x %in% animals) %>% # Work on a subset
group_by(x) %>%
sample_n(n, replace = TRUE) %>% # sample only once per each group
group_by(x) %>%
mutate(id = row_number()) %>% # Create an index for rowSums
spread(x, y) %>% # Convert to wide format for rowSums
mutate(res = rowSums(.[-1])) %>% # Sum everything at once
.\$res # You don't need this if you want a data.frame result instead
}

set.seed(123) # For reproducible output
ani_samp(animals, 10)
# [1] 18 24 14 24 19 18 19 19 19 14
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download