pachamaltese - 19 days ago 11
R Question

# Which s the best way to parallelize cosine distance?

My R session crashes after the timeout is exceeded when I try to compute the cosine distance with a large dataset (~600,000 lines)

For small datasets my code works and this is an example:

``````library(data.table)
library(lsa)
relevant.data <- cbind(mtcars\$wt)
rownames(relevant.data) <- rownames(mtcars)
cosine(t(relevant.data))
``````

I've read some posts on this website to parallelize cosine function but no luck.

Does a very efficient method exist?

Do you suggest rccp like this post? Parallel cosine distance using clusterapply in R

If computing something like a correlation matrix is inefficient. What do you suggest?

Coding it in `Rcpp` might buy you enough that you don't need the extra hassle of parallelizing. Example below (but I don't know how it will do on your system/with a real-sized problem: a vector of length 1e8 (equivalent to a 10,000 by 10,000 matrix) takes 763Mb, so even storing the results for a problem 60^2 times larger (=2.75Tb if I've calculated correctly) might be difficult ...).

``````x <- as.matrix(mtcars)
library(lsa)
``````

Function from `lsa`:

``````cosine(as.matrix(mtcars))
``````

Slightly stripped-down R code:

``````cosR <- function(x) {
co <- array(0, c(ncol(x), ncol(x)))
## f <- colnames(x)
## dimnames(co) <- list(f, f)
for (i in 2:ncol(x)) {
for (j in 1:(i - 1)) {
co[i,j] <- crossprod(x[,i], x[,j])/
sqrt(crossprod(x[,i]) * crossprod(x[,j]))
}
}
co <- co + t(co)
diag(co) <- 1
return(as.matrix(co))
}
``````

Rcpp version, slightly modified from here:

``````library(Rcpp)
code="NumericMatrix cosCpp(NumericMatrix Xr) {
int n = Xr.nrow(), k = Xr.ncol();
arma::mat X(Xr.begin(), n, k, false); // reuses memory and avoids extra copy
arma::mat Y = arma::trans(X) * X; // matrix product
arma::mat res = Y / (arma::sqrt(arma::diagvec(Y)) * arma::trans(arma::sqrt(arma::diagvec(Y))));
return Rcpp::wrap(res);
}")
``````

Test equality:

``````identical(cosR(x),unname(cosine(x)))
all.equal(cosCpp(x),cosR(x))

library(microbenchmark)
microbenchmark(cosine(x),cosR(x),cosCpp(x))
## Unit: nanoseconds
##       expr    min      lq       mean  median      uq      max neval cld
##  cosine(x) 460046 1181837 2069604.51 1530719 2528021  8757989   100   b
##    cosR(x) 542414 1096448 1915011.12 1331277 2321596 11740233   100   b
##  cosCpp(x)      7   12472   35827.76   17999   30556   644551   100  a
``````

The Rcpp version is about 1331277/17999 = 74 times faster, and might (?) get you around memory issues as well.