Shelly Shelly - 16 days ago 11
Scala Question

How to use the output of RowMatrix.columnSimilarities

I need to compute similarities between columns of a row and tried columnsimilarities() method to get results.

public static void main(String[] args) {

SparkConf sparkConf = new SparkConf().setAppName("CollarberativeFilter").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
SparkSession spark = SparkSession.builder().appName("CollarberativeFilter").getOrCreate();
double[][] array = {{5,0,5}, {0,10,0}, {5,0,5}};
LinkedList<Vector> rowsList = new LinkedList<Vector>();
for (int i = 0; i < array.length; i++) {
Vector currentRow = Vectors.dense(array[i]);
rowsList.add(currentRow);
}
JavaRDD<Vector> rows = sc.parallelize(rowsList);

// Create a RowMatrix from JavaRDD<Vector>.
RowMatrix mat = new RowMatrix(rows.rdd());
CoordinateMatrix simsPerfect = mat.columnSimilarities();
RowMatrix mat2 = simsPerfect.toRowMatrix();
List<Vector> vs2 = mat2.rows().toJavaRDD().collect();
List<Vector> vs = mat.rows().toJavaRDD().collect();
System.out.println("mat");
for(Vector v: vs) {
System.out.println(v);
}
System.out.println("mat2");
for(Vector v: vs2) {
System.out.println(v);
}
JavaRDD<MatrixEntry> entries = simsPerfect.entries().toJavaRDD();
JavaRDD<String> output = entries.map(new Function<MatrixEntry, String>() {
public String call(MatrixEntry e) {
return String.format("%d,%d,%s", e.i(), e.j(), e.value());
}
});
output.saveAsTextFile("resources123/data.txt");

}


But the


output in the text file was 0,2,0.9999999999999998


.

Next I tried the same example using
double[][] array = {{1,3}, {2,7}};

Then the


output of the text file was 0,1,0.9982743731749959


Can someone explain me the answer format.Can't I get a score for each and every column pair of the matrix.Such as in 3 by 3 matrix I need 3 scores for similarity between 1,2 columns , 2,3 columns , 3,1 columns.
Any help appreciated.

Answer

Column Similarity is computed with the Cosine Similarity defined as follows:

Cosine Similarity

Since you included the scala tag I am going to cheat and repeat what you did in the Scala REPL:

scala> import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.linalg.{Vectors, Vector}

scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix

scala> val matVec = Vector(Vectors.dense(5,0,5), Vectors.dense(0,10,0), Vectors.dense(5,0,5))
matVec: scala.collection.immutable.Vector[org.apache.spark.mllib.linalg.Vector] = Vector([5.0,0.0,5.0], [0.0,10.0,0.0], [5.0,0.0,5.0])

scala> val matRDD = sc.parallelize(matVec)
matRDD: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[44] at parallelize at <console>:37

scala> val myRowMat = new RowMatrix(matRDD)
myRowMat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@7a7a07c2

scala> myRowMat.columnSimilarities.entries.collect.foreach{println}
MatrixEntry(0,2,0.9999999999999998)

This output means that there was only one nonzero entry at (row0, col2). Thus the actual (upper triangular) output was:

0    0    .9999
0    0    0
0    0    0

Which is what you would expect (since the dot product between col0 and col1 is zero and the dot product between col1 and col2 is zero)

Here is an example with a less sparse column similarities matrix:

scala> def randVec(len: Int) : org.apache.spark.mllib.linalg.Vector =
     | Vectors.dense(Array.fill(len)(Random.nextDouble))
randVec: (len: Int)org.apache.spark.mllib.linalg.Vector

scala> val randRDD = sc.parallelize(Seq.fill(3)(randVec(4))
randRDD: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[123] at parallelize at <console>:38

scala> val randRowMat = new RowMatrix(randRDD)
randRowMat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@77d9112e

scala> randRowMat.rows.collect.foreach{println}
[0.11049508671100228,0.6560383649078886,0.08647831963379027,0.918734774579884]
[0.5709766390994561,0.5404121150599919,0.8206115742925799,0.12848224469499103]
[0.5414651842028494,0.26273347471310016,0.3139446375461201,0.351113866208812]

scala> randRowMat.columnSimilarities.entries.collect.foreach{println}
MatrixEntry(0,3,0.4630854334046888)
MatrixEntry(0,2,0.9238294198864545)
MatrixEntry(2,3,0.33700154742702093)
MatrixEntry(0,1,0.7402725425024911)
MatrixEntry(1,2,0.7418690274112878)
MatrixEntry(1,3,0.8662504236158493)

Which represents the following matrix:

0       0.74027     0.92382     0.46308
0       0           0.74186     0.86625
0       0           0           0.33700
0       0           0           0