ajg0101 - 1 year ago 84

R Question

I have data frames that look similar to this:

`maindata <- data.frame(cbind(num=c(79,61,62,57),`

denom=c(162356,170189,164634,162006),

group=c(1,2,3,4)))

My intention is to select each row, perform bootstrap resampling, find quantiles for 95% confidence intervals, and output CIs to a data frame with 2 columns and the same number of rows as the original data frame. This function with nested foreach and %do% works pretty well but is slow with more iterations (e.g. 1000) and data frames with more rows:

`boots = function(data, boots, seed=1234){`

if (!missing(seed))

set.seed(seed)

pct <- NULL

ci.pct <- list()

foreach(j=1:nrow(data)) %do% {

datast1 <- c(rep(1, data[j,]$num),

rep(0, data[j,]$denom))

foreach(i=1:boots, .combine='c') %do% {

index <- sample(1:length(datast1), size=length(datast1), replace=TRUE)

sampledata <- datast1[index]

pct[i] <- mean(sampledata)

}

ci.pct[[j]] <- cbind(quantile(pct, prob=c(0.025))*100000,

quantile(pct, prob=c(0.975))*100000)

}

ci.pcts <- do.call("rbind", ci.pct)

return(ci.pcts)

}

boots(data=maindata, boots=5, seed=1234)

I have been trying to figure out a way to do this with %dopar% for parallel processing but can't quite grasp it:

`bootsd = function(data, boots, seed=1234){`

if (!missing(seed))

set.seed(seed)

pct <- NULL

ci.pct <- list()

foreach(j=1:nrow(data)) %do% {

datast1 <- c(rep(1, data[j,]$num),

rep(0, data[j,]$denom))

foreach(i=1:boots, .combine='c') %dopar% {

index <- sample(1:length(datast1), size=length(datast1), replace=TRUE)

sampledata <- datast1[index]

pct[i] <- mean(sampledata)

}

ci.pct[[j]] <- cbind(quantile(pct, prob=c(0.025))*100000,

quantile(pct, prob=c(0.975))*100000)

}

ci.pcts <- do.call("rbind", ci.pct)

return(ci.pcts)

}

bootsd(data=maindata, boots=5, seed=1234)

Does anyone have advice on how to modify the code to get it to run faster by correctly implementing %dopar% or some other neat trick?

Answer Source

I have slightly rewritten your function. I see the `foreach`

as a function and it returns the result from the loop. Now it works with `%dopar%`

. The only problem - it does not obey the seed. Different results are returned in each run. Probably you have to look at the `doRNG`

package if this is necessary.

```
bootsd = function(data, boots, seed = 1234){
if (!missing(seed)) set.seed(seed)
ci.pct <- foreach(j = 1:nrow(data)) %do% {
datast1 <- c(rep(1, data[j, "num"]),
rep(0, data[j, "denom"]))
pct <- foreach(i = 1:boots, .combine = 'c') %dopar% {
index <- sample(1:length(datast1), size = length(datast1), replace = T)
sampledata <- datast1[index]
mean(sampledata)
}
cbind(quantile(pct, prob=c(0.025))*100000,
quantile(pct, prob=c(0.975))*100000)
}
ci.pcts <- do.call("rbind", ci.pct)
return(ci.pcts)
}
bootsd(data = maindata, boots = 5, seed = 1234)
```