Alexander Ershov Alexander Ershov - 2 months ago 41
Scala Question

Cosine similarity on TFIDF using apache spark

I'm trying to compute cosine similarity matrix on TFIDF, using Apache Spark.
Here is my code:

def cosSim(input: RDD[Seq[String]]) = {
val hashingTF = new HashingTF()
val tf = hashingTF.transform(input)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities

I have around 3000 lines in the input, but if I do sim.numRows() or sim.numCols() I will see 1048576 instead of 3K, as I understand, it's because val tfidf and therefore val mat both has size 3K * 1048576 where 1048576 it's number of tf features. Maybe to solve the problem I have to transpose mat, but I don't know how to do it.


You can try:

import org.apache.spark.mllib.linalg.distributed._

val irm = new IndexedRowMatrix( {
   case (v, i) => IndexedRow(i, v)