shenglih shenglih - 22 days ago 6
R Question

parallelizing heterogenous tasks in R: foreach, doMC, doParallel

Here's what's been puzzling me:

When you schedule a sequence of tasks that are homogenous in terms of content but heterogenous in terms of processing time (not known ex ante) using foreach, how exactly does foreach process these embarrassingly parallel tasks sequentially?

For instance, I registered 4 threads

registerDoMC(cores=4)
and I have 10 tasks and the 4th and the 5th each turned out to be longer than all others combine. Then the first batch is obviously the 1st, 2nd, 3rd and 4th. When the 1st, 2nd and 3rd are done, how exactly does foreach assign other tasks sequentially? Is that random (which seems so from my observation)? And what's a good practice to speed up if it turns out some tasks take way longer time to process?

I am sorry for not providing concrete examples since my actual projects/codes are much more involved...

Any experiences/guidance/pointers are very much appreciated!

Answer

The doMC package is a wrapper around mclapply, and by default mclapply preschedules tasks, which means it splits the tasks into groups, or chunks. The twist is that it preschedules those tasks round-robin. Thus, if you have 10 tasks and 4 workers, the tasks will be assigned as follows:

  • worker 1: tasks 1, 5, 9
  • worker 2: tasks 2, 6, 10
  • worker 3: tasks 3, 7
  • worker 4: tasks 4, 8

If you're lucky, this will give reasonable performance even if the tasks have very different lengths, but you can disable prescheduling in doMC as follows:

opts <- list(preschedule=FALSE)
results <- foreach(i=1:10, .options.multicore=opts) %dopar% {
    # ...
}

This will cause doMC to call mclapply with the mc.preschedule=FALSE option so that tasks are assigned to workers as they complete their previous task which is naturally load balancing.