Yang Yang - 2 months ago 20

R Question

I have a very large dataset (5000*100) and I want to use the

`kmeans`

`clusterApply`

`set.seed(88)`

mydata=rnorm(5000*100)

mydata=matrix(data=mydata,nrow = 5000,ncol = 100)

parallel.a=function(i) {

kmeans(mydata,3,nstart = i,iter.max = 1000)

}

library(parallel)

cl.cores <- detectCores()-1

cl <- makeCluster(cl.cores)

clusterSetRNGStream(cl,iseed=1234)

fit.km = clusterApply(cl,x,fun=parallel.a(500))

stopCluster(cl)

The

`clusterApply`

`clusterApply`

`parSapply`

`parLapply`

Answer

Here's a way to use `clusterApply`

to perform a parallel kmeans by parallelizing over the `nstart`

argument (assuming it is greater than one):

```
library(parallel)
nw <- detectCores()
cl <- makeCluster(nw)
clusterSetRNGStream(cl, iseed=1234)
set.seed(88)
mydata <- matrix(rnorm(5000 * 100), nrow=5000, ncol=100)
# Parallelize over the "nstart" argument
nstart <- 100
# Create vector of length "nw" where sum(nstartv) == nstart
nstartv <- rep(ceiling(nstart / nw), nw)
results <- clusterApply(cl, nstartv,
function(n, x) kmeans(x, 3, nstart=n, iter.max=1000),
mydata)
# Pick the best result
i <- sapply(results, function(result) result$tot.withinss)
result <- results[[which.min(i)]]
print(result$tot.withinss)
```

People typically export `mydata`

to the workers, but this example passes it as an additional argument to `clusterApply`

. That makes sense (since the number of tasks is equal to the number of workers), is slightly more efficient (since it effectively combines the export with the computation), and avoids creating a global variable on the cluster workers (which is a bit more tidy). (Of course, exporting makes more sense if you plan to perform more computations on the workers with that data set.)

Note that you can use `detectCores()-1`

workers if you like, but benchmarking on my machine shows that it performs significantly faster with `detectCores()`

workers. I suggest that you benchmark it on your machine to see what works better for you.

As for the difference between the different parallel functions, `clusterApply`

is a parallel version of `lapply`

that processes each value of `x`

in a separate task. `parLapply`

is a parallel version of `lapply`

that splits `x`

such that it sends only one task per cluster worker (which can be more efficient). `parSapply`

calls `parLapply`

but simplifies the result in the same way that `sapply`

simplifies the result of calling `lapply`

.

`clusterApply`

makes sense for a parallel kmeans since you are manually splitting `nstart`

such that it sends only one task per cluster worker, making `parLapply`

unnecessary.