Yang Yang Yang Yang - 19 days ago 7
R Question

How to do parallelization k-means in R?

I have a very large dataset (5000*100) and I want to use the

kmeans
function to find clusters. However, I do not know how to use the
clusterApply
function.

set.seed(88)
mydata=rnorm(5000*100)
mydata=matrix(data=mydata,nrow = 5000,ncol = 100)

parallel.a=function(i) {
kmeans(mydata,3,nstart = i,iter.max = 1000)
}

library(parallel)
cl.cores <- detectCores()-1
cl <- makeCluster(cl.cores)
clusterSetRNGStream(cl,iseed=1234)
fit.km = clusterApply(cl,x,fun=parallel.a(500))
stopCluster(cl)


The
clusterApply
requires 'x' value which I do not know how to set. Also, what is the difference between
clusterApply
,
parSapply
and
parLapply
? Thanks a lot.

Answer

Here's a way to use clusterApply to perform a parallel kmeans by parallelizing over the nstart argument (assuming it is greater than one):

library(parallel)
nw <- detectCores()
cl <- makeCluster(nw)
clusterSetRNGStream(cl, iseed=1234)
set.seed(88)
mydata <- matrix(rnorm(5000 * 100), nrow=5000, ncol=100)

# Parallelize over the "nstart" argument
nstart <- 100
# Create vector of length "nw" where sum(nstartv) == nstart
nstartv <- rep(ceiling(nstart / nw), nw)
results <- clusterApply(cl, nstartv,
        function(n, x) kmeans(x, 3, nstart=n, iter.max=1000),
        mydata)
# Pick the best result
i <- sapply(results, function(result) result$tot.withinss)
result <- results[[which.min(i)]]
print(result$tot.withinss)

People typically export mydata to the workers, but this example passes it as an additional argument to clusterApply. That makes sense (since the number of tasks is equal to the number of workers), is slightly more efficient (since it effectively combines the export with the computation), and avoids creating a global variable on the cluster workers (which is a bit more tidy). (Of course, exporting makes more sense if you plan to perform more computations on the workers with that data set.)

Note that you can use detectCores()-1 workers if you like, but benchmarking on my machine shows that it performs significantly faster with detectCores() workers. I suggest that you benchmark it on your machine to see what works better for you.


As for the difference between the different parallel functions, clusterApply is a parallel version of lapply that processes each value of x in a separate task. parLapply is a parallel version of lapply that splits x such that it sends only one task per cluster worker (which can be more efficient). parSapply calls parLapply but simplifies the result in the same way that sapply simplifies the result of calling lapply.

clusterApply makes sense for a parallel kmeans since you are manually splitting nstart such that it sends only one task per cluster worker, making parLapply unnecessary.

Comments