Zach - 9 months ago 48

R Question

When I run 2 random forests in caret, I get the exact same results if I set a random seed:

`library(caret)`

library(doParallel)

set.seed(42)

myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)

model1 <- train(Species~., iris, method='rf', trControl=myControl)

set.seed(42)

model2 <- train(Species~., iris, method='rf', trControl=myControl)

> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))

[1] TRUE

However, if I register a parallel back-end to speed up the modeling, I get a different result each time I run the model:

`cl <- makeCluster(detectCores())`

registerDoParallel(cl)

set.seed(42)

myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)

model1 <- train(Species~., iris, method='rf', trControl=myControl)

set.seed(42)

model2 <- train(Species~., iris, method='rf', trControl=myControl)

stopCluster(cl)

> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))

[1] "Component 2: Mean relative difference: 0.01813729"

[2] "Component 3: Mean relative difference: 0.02271638"

Is there any way to fix this issue? One suggestion was to use the doRNG package, but

`train`

`library(doRNG)`

cl <- makeCluster(detectCores())

registerDoParallel(cl)

registerDoRNG()

set.seed(42)

myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)

> model1 <- train(Species~., iris, method='rf', trControl=myControl)

Error in list(e1 = list(args = seq(along = resampleIndex)(), argnames = "iter", :

nested/conditional foreach loops are not supported yet.

See the package's vignette for a work around.

UPDATE:

I thought this problem could be solved using

`doSNOW`

`clusterSetupRNG`

`set.seed(42)`

library(caret)

library(doSNOW)

cl <- makeCluster(8, type = "SOCK")

registerDoSNOW(cl)

myControl <- trainControl(method='cv', index=createFolds(iris$Species))

clusterSetupRNG(cl, seed=rep(12345,6))

a <- clusterCall(cl, runif, 10000)

model1 <- train(Species~., iris, method='rf', trControl=myControl)

clusterSetupRNG(cl, seed=rep(12345,6))

b <- clusterCall(cl, runif, 10000)

model2 <- train(Species~., iris, method='rf', trControl=myControl)

all.equal(a, b)

[1] TRUE

all.equal(predict(model1, type='prob'), predict(model2, type='prob'))

[1] "Component 2: Mean relative difference: 0.01890339"

[2] "Component 3: Mean relative difference: 0.01656751"

stopCluster(cl)

What's special about foreach, and why doesn't it use the seeds I initiated on the cluster? objects

`a`

`b`

`model1`

`model2`

Answer Source

One easy way to run fully reproducible model in parallel mode using the `caret`

package is by using the seeds argument when calling the train control. Here the above question is resolved, check the trainControl help page for further infos.

```
library(doParallel); library(caret)
#create a list of seed, here change the seed for each resampling
set.seed(123)
#length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)
#(3 is the number of tuning parameter, mtry for rf, here equal to ncol(iris)-2)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 3)
#for the last model
seeds[[11]]<-sample.int(1000, 1)
#control list
myControl <- trainControl(method='cv', seeds=seeds, index=createFolds(iris$Species))
#run model in parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
model1 <- train(Species~., iris, method='rf', trControl=myControl)
model2 <- train(Species~., iris, method='rf', trControl=myControl)
stopCluster(cl)
#compare
all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE
```