Imlerith Imlerith - 2 months ago 6x
R Question

parallel package in R passing large object by reference on windows

suppose I have 8 cores on my computer. I have loaded a 2Go dataset on RAM and I want each one of these workers to read only from that dataset what I do:

worker.function(rowstoread, dataset)
#read a couple of rows from the dataset (those rows are sent as argument to the worker function)
#process these rows
#return results

I was wondering why this would incur a copy of the dataset at the level of each worker since my workers are only reading from the dataset. They are not modifying anything in the dataset.

Is there any fix to that or is this inherent to R? Also would this problem be alleviated if I use a Linux machine instead or would a copy of the dataset still occur at the level of each worker ?

here is a more detailed but very simplified example of how I use the parallel package (on windows):

#data generation
dir <- "C:/Users/things_to_process/"

for(i in 1:800)
my.matrix <- matrix(runif(100),ncol=10,nrow=10)


#worker function
worker.function <- function(files)
files.length <- length(files)
partial.results <- vector('list',files.length)

for(i in 1:files.length)
#instead of reading from a file like I am doing below, I would like to
#read from a list that is already in RAM

matrix <- readRDS(files[i])
partial.results[[i]] <- sum(diag(matrix))


#master part
cl <- makeCluster(detectCores(), type = "PSOCK")

file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE)

part <- clusterSplit(cl,seq_along(file_list))
files.partitioned <- lapply(part,function(p) file_list[p])

results <- clusterApply(cl,files.partitioned,worker.function)

result <- Reduce('+',results)


TL;DR: This can work much better on Linux.

There are two problems here:

  1. R is single-threaded and only knows parallelism at the process level.

  2. Windows doesn't have a "fork" system call, unlike Linux.

If you are on Linux and use a parallelization backend that uses forking (e.g., parallel::makeForkCluster()), you may be able to use the dataset in the workers without reloading/copying it.


Modern operating systems support multiple threads per process, all of which have access to the same data. All threads in a process must ensure that concurrent data access always leaves the memory in a consistent state, even if multiple threads update the same location. This is usually done by locking mechanisms, but is also non-trivial to implement. Some parts of R (e.g., if I remember correctly, the memory allocator) are inherently single-threaded, and so must be is all (interpreted) R code. The only way to work in parallel with R is to spawn multiple processes.


Each new process on Windows starts "empty" and must load its code and data from external storage. On the other hand, Linux has a "fork" system call, which allows creating a second process that starts with exactly the same memory contents (code and data).