Vespasian - 9 months ago 109

R Question

currently I'm using the build in function dist to calculate my distance matrix in R.

`dist(featureVector,method="manhattan")`

This is currently the bottlneck of the application and therefore the idea was to parallize this task(conceptually this should be possible)

Searching google and this forum did not succeed.

Does anybody has an idea?

Answer

Here's the structure for one route you could go. It is not faster than just using the `dist()`

function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just using `dist()`

```
library(parallel)
vec.array <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)
TaxiDistFun <- function(one.vec, whole.matrix) {
diff.matrix <- t(t(whole.matrix) - one.vec)
this.row <- apply(diff.matrix, 1, function(x) sum(abs(x)))
return(this.row)
}
cl <- makeCluster(detectCores())
clusterExport(cl, list("vec.array", "TaxiDistFun"))
system.time(dist.array <- parRapply(cl, vec.array,
function(x) TaxiDistFun(x, vec.array)))
stopCluster(cl)
dim(dist.array) <- c(2000, 2000)
```

Source (Stackoverflow)