ajg0101 ajg0101 - 4 months ago 41
R Question

Nested foreach and dopar - bootstrapping each row of data frame

I have data frames that look similar to this:

maindata <- data.frame(cbind(num=c(79,61,62,57),
denom=c(162356,170189,164634,162006),
group=c(1,2,3,4)))


My intention is to select each row, perform bootstrap resampling, find quantiles for 95% confidence intervals, and output CIs to a data frame with 2 columns and the same number of rows as the original data frame. This function with nested foreach and %do% works pretty well but is slow with more iterations (e.g. 1000) and data frames with more rows:

boots = function(data, boots, seed=1234){
if (!missing(seed))
set.seed(seed)
pct <- NULL
ci.pct <- list()
foreach(j=1:nrow(data)) %do% {
datast1 <- c(rep(1, data[j,]$num),
rep(0, data[j,]$denom))
foreach(i=1:boots, .combine='c') %do% {
index <- sample(1:length(datast1), size=length(datast1), replace=TRUE)
sampledata <- datast1[index]
pct[i] <- mean(sampledata)
}
ci.pct[[j]] <- cbind(quantile(pct, prob=c(0.025))*100000,
quantile(pct, prob=c(0.975))*100000)
}
ci.pcts <- do.call("rbind", ci.pct)
return(ci.pcts)
}
boots(data=maindata, boots=5, seed=1234)


I have been trying to figure out a way to do this with %dopar% for parallel processing but can't quite grasp it:

bootsd = function(data, boots, seed=1234){
if (!missing(seed))
set.seed(seed)
pct <- NULL
ci.pct <- list()
foreach(j=1:nrow(data)) %do% {
datast1 <- c(rep(1, data[j,]$num),
rep(0, data[j,]$denom))
foreach(i=1:boots, .combine='c') %dopar% {
index <- sample(1:length(datast1), size=length(datast1), replace=TRUE)
sampledata <- datast1[index]
pct[i] <- mean(sampledata)
}
ci.pct[[j]] <- cbind(quantile(pct, prob=c(0.025))*100000,
quantile(pct, prob=c(0.975))*100000)
}
ci.pcts <- do.call("rbind", ci.pct)
return(ci.pcts)
}
bootsd(data=maindata, boots=5, seed=1234)


Does anyone have advice on how to modify the code to get it to run faster by correctly implementing %dopar% or some other neat trick?

Answer

I have slightly rewritten your function. I see the foreach as a function and it returns the result from the loop. Now it works with %dopar%. The only problem - it does not obey the seed. Different results are returned in each run. Probably you have to look at the doRNG package if this is necessary.

bootsd = function(data, boots, seed = 1234){
  if (!missing(seed)) set.seed(seed) 
  ci.pct <- foreach(j = 1:nrow(data)) %do% {
    datast1 <- c(rep(1, data[j, "num"]),
                 rep(0, data[j, "denom"]))
    pct <- foreach(i = 1:boots, .combine = 'c') %dopar% {
      index <- sample(1:length(datast1), size = length(datast1), replace = T)
      sampledata <- datast1[index]
      mean(sampledata)
    }
    cbind(quantile(pct, prob=c(0.025))*100000,
          quantile(pct, prob=c(0.975))*100000)
  }
  ci.pcts <- do.call("rbind", ci.pct)
  return(ci.pcts)
}

bootsd(data = maindata, boots = 5, seed = 1234)