Armand Grillet Armand Grillet - 1 year ago 78
Scala Question

Grouping a RDD using an array

I have a RDD with these elements:

("a", Array(1, 2)) | ("b", Array(3, 4)) | ("c", Array(1, 2))

I wish to group it using the array in order to have this:

(Array("a", "c"), Array(1, 2)) | (Array("b"), Array(3, 4))

How to do that (preferably in Scala)?

Answer Source

Since you can't use arrays as keys using Spark's default partitioner, you'll have to group by the arrays converted to lists, then just map the results back to the structure you're after:

val input: RDD[(String, Array[Int])] = ???

val result: RDD[(Array[String], Array[Int])] = input
  .groupBy(_._2.toList) // group by array
  .values // keep values only, of type Iterable[(String, Array[Int])]
  .map(it => (, it.head._2)) // map back to requested format
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download