Prakash Khandelwal Prakash Khandelwal - 3 months ago 23
Scala Question

Apache Spark Scala : How to maintain order of values while grouping rdd by key

May be i am asking very basic question apology for that, but i didn't find it's answer on internet. I have paired RDD want to use something like aggragateByKey and concatenating all the values by a key. Value which occur first in input RDD should come first in the aggragated RDD.

Input RDD [Int, Int]
2 20
1 10
2 8
2 25

Output RDD (Aggregated RDD)
2 20 8 25
1 10


I tried aggregateByKey and gropByKey, both are giving me ouput, but order of values is not maintained. So please suggest something in this.

Answer

Since groupByKey and aggregateByKey indeed cannot preserve order - you'll have to artificially add a "hint" to each record so that you can order by that hint yourself after the grouping:

val input = sc.parallelize(Seq((2, 20), (1, 10), (2, 8), (2, 25)))

val withIndex: RDD[(Int, (Long, Int))] = input
  .zipWithIndex()  // adds index to each record, will be used to order result
  .map { case ((k, v), i) => (k, (i, v)) } // restructure into (key, (index, value))

val result: RDD[(Int, List[Int])] = withIndex
  .groupByKey()
  .map { case (k, it) => (k, it.toList.sortBy(_._1).map(_._2)) } // order values and remove index