Armand Grillet Armand Grillet - 3 months ago 12
Scala Question

Flattening the key of a RDD

I have a Spark RDD of type

(Array[breeze.linalg.DenseVector[Double]], breeze.linalg.DenseVector[Double])
. I wish to flatten its key to transform it into a RDD of type
breeze.linalg.DenseVector[Double], breeze.linalg.DenseVector[Double])
. I am currently doing:

val newRDD = oldRDD.flatMap(ob => anonymousOrdering(ob))


The signature of anonymousOrdering() is
String => (Array[DenseVector[Double]], DenseVector[Double])
.

It returns
type mismatch: required: TraversableOnce[?]
. The Python code doing the same thing is:

newRDD = oldRDD.flatMap(lambda point: [(tile, point) for tile in anonymousOrdering(point)])


How to do the same thing in Scala ? I generally use
flatMapValues
but here I need to flatten the key.

Answer

If I understand your question correctly, you can do:

val newRDD = oldRDD.flatMap(ob => anonymousOrdering(ob))
// newRDD is RDD[(Array[DenseVector], DenseVector)]

In that case, you can "flatten" the Array portion of the tuple using pattern matching and a for/yield statement:

newRDD = newRDD.flatMap{case (a: Array[DenseVector[Double]], b: DenseVector[Double]) => for (v <- a) yield (v, b)}
// newRDD is RDD[(DenseVector, DenseVector)]

Although it's still not clear to me where/how you want to use groupByKey