Alexander Ershov Alexander Ershov - 27 days ago 18
Scala Question

Cosine similarity on TFIDF using apache spark

I'm trying to compute cosine similarity matrix on TFIDF, using Apache Spark.
Here is my code:

def cosSim(input: RDD[Seq[String]]) = {
val hashingTF = new HashingTF()
val tf = hashingTF.transform(input)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities
sim
}


I have around 3000 lines in the input, but if I do sim.numRows() or sim.numCols() I will see 1048576 instead of 3K, as I understand, it's because val tfidf and therefore val mat both has size 3K * 1048576 where 1048576 it's number of tf features. Maybe to solve the problem I have to transpose mat, but I don't know how to do it.

Answer

You can try:

import org.apache.spark.mllib.linalg.distributed._

val irm = new IndexedRowMatrix(rowMatrix.rows.zipWithIndex.map {
   case (v, i) => IndexedRow(i, v)
})

irm.toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities