Alexander Ershov - 4 months ago 86

Scala Question

I'm trying to compute cosine similarity matrix on TFIDF, using Apache Spark.

Here is my code:

`def cosSim(input: RDD[Seq[String]]) = {`

val hashingTF = new HashingTF()

val tf = hashingTF.transform(input)

tf.cache()

val idf = new IDF().fit(tf)

val tfidf = idf.transform(tf)

val mat = new RowMatrix(tfidf)

val sim = mat.columnSimilarities

sim

}

I have around 3000 lines in the input, but if I do sim.numRows() or sim.numCols() I will see 1048576 instead of 3K, as I understand, it's because val tfidf and therefore val mat both has size 3K * 1048576 where 1048576 it's number of tf features. Maybe to solve the problem I have to transpose mat, but I don't know how to do it.

Answer

You can try:

```
import org.apache.spark.mllib.linalg.distributed._
val irm = new IndexedRowMatrix(rowMatrix.rows.zipWithIndex.map {
case (v, i) => IndexedRow(i, v)
})
irm.toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities
```