suppose I have 8 cores on my computer. I have loaded a 2Go dataset on RAM and I want each one of these workers to read only from that dataset what I do:
#read a couple of rows from the dataset (those rows are sent as argument to the worker function)
#process these rows
dir <- "C:/Users/things_to_process/"
for(i in 1:800)
my.matrix <- matrix(runif(100),ncol=10,nrow=10)
worker.function <- function(files)
files.length <- length(files)
partial.results <- vector('list',files.length)
for(i in 1:files.length)
#instead of reading from a file like I am doing below, I would like to
#read from a list that is already in RAM
matrix <- readRDS(files[i])
partial.results[[i]] <- sum(diag(matrix))
cl <- makeCluster(detectCores(), type = "PSOCK")
file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE)
part <- clusterSplit(cl,seq_along(file_list))
files.partitioned <- lapply(part,function(p) file_list[p])
results <- clusterApply(cl,files.partitioned,worker.function)
result <- Reduce('+',results)
TL;DR: This can work much better on Linux.
There are two problems here:
R is single-threaded and only knows parallelism at the process level.
Windows doesn't have a "fork" system call, unlike Linux.
If you are on Linux and use a parallelization backend that uses forking (e.g.,
parallel::makeForkCluster()), you may be able to use the dataset in the workers without reloading/copying it.
Modern operating systems support multiple threads per process, all of which have access to the same data. All threads in a process must ensure that concurrent data access always leaves the memory in a consistent state, even if multiple threads update the same location. This is usually done by locking mechanisms, but is also non-trivial to implement. Some parts of R (e.g., if I remember correctly, the memory allocator) are inherently single-threaded, and so must be is all (interpreted) R code. The only way to work in parallel with R is to spawn multiple processes.
Each new process on Windows starts "empty" and must load its code and data from external storage. On the other hand, Linux has a "fork" system call, which allows creating a second process that starts with exactly the same memory contents (code and data).