Quantad - 1 year ago 114
Scala Question

# Create a SparseVector from the elements of RDD

Using Spark, I have a data structure of type

`val rdd = RDD[(x: Int, y:Int), cov:Double]`
in Scala, where each element of the RDD represents an element of a matrix with
`x`
representing the row,
`y`
representing the column and
`cov`
representing the value of the element:

I need to create SparseVectors from rows of this matrix. So I decided to first convert the rdd to
`RDD[x: Int, (y:Int, cov:Double)]`
and then use groupByKey to put all elements of a specific row together like this:

`val rdd2 = rdd.map{case ((x,y),cov) => (x, (y, cov))}.groupByKey()`

Now I need to create the SparseVectors:

``````val N = 7     //Vector Size
val spvec = {(x: Int,y: Iterable[(Int, Double)]) => new SparseVector(N.toLong, Array(y.map(el => el._1.toInt)), Array(y.map(el => el._2.toDouble)))}
val vecs = rdd2.map(spvec)
``````

However, this is the error that pops up.

``````type mismatch; found :Iterable[Int] required:Int
type mismatch; found :Iterable[Double] required:Double
``````

I am guessing that
`y.map(el => el._1.toInt)`
is returning an iterable which Array cannot be applied on. I would appreciate if someone could help with how to do this.

The simplest solution is to convert to `RowMatrix`:

``````import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}

val rdd: RDD[((Int, Int), Double)] = ???

val vs: RDD[org.apache.spark.mllib.linalg.SparseVector]= new CoordinateMatrix(
rdd.map{
case ((x, y), cov) => MatrixEntry(x, y, cov)
}
).toRowMatrix.rows.map(_.toSparse)
``````

If you want to preserve row indices you can use `toIndexedRowMatrix` instead:

``````import org.apache.spark.mllib.linalg.distributed.IndexedRow

new CoordinateMatrix(
rdd.map{
case ((x, y), cov) => MatrixEntry(x, y, cov)
}
).toIndexedRowMatrix.rows.map { case IndexedRow(i, vs) => (i, vs.toSparse) }
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download