Armand Grillet Armand Grillet - 2 months ago 5x
Scala Question

Grouping a RDD using an array

I have a RDD with these elements:

("a", Array(1, 2)) | ("b", Array(3, 4)) | ("c", Array(1, 2))

I wish to group it using the array in order to have this:

(Array("a", "c"), Array(1, 2)) | (Array("b"), Array(3, 4))

How to do that (preferably in Scala)?


Since you can't use arrays as keys using Spark's default partitioner, you'll have to group by the arrays converted to lists, then just map the results back to the structure you're after:

val input: RDD[(String, Array[Int])] = ???

val result: RDD[(Array[String], Array[Int])] = input
  .groupBy(_._2.toList) // group by array
  .values // keep values only, of type Iterable[(String, Array[Int])]
  .map(it => (, it.head._2)) // map back to requested format