Prakash Khandelwal Prakash Khandelwal - 9 months ago 63
Scala Question

Apache Spark Scala : How to maintain order of values while grouping rdd by key

May be i am asking very basic question apology for that, but i didn't find it's answer on internet. I have paired RDD want to use something like aggragateByKey and concatenating all the values by a key. Value which occur first in input RDD should come first in the aggragated RDD.

Input RDD [Int, Int]
2 20
1 10
2 8
2 25

Output RDD (Aggregated RDD)
2 20 8 25
1 10

I tried aggregateByKey and gropByKey, both are giving me ouput, but order of values is not maintained. So please suggest something in this.


Since groupByKey and aggregateByKey indeed cannot preserve order - you'll have to artificially add a "hint" to each record so that you can order by that hint yourself after the grouping:

val input = sc.parallelize(Seq((2, 20), (1, 10), (2, 8), (2, 25)))

val withIndex: RDD[(Int, (Long, Int))] = input
  .zipWithIndex()  // adds index to each record, will be used to order result
  .map { case ((k, v), i) => (k, (i, v)) } // restructure into (key, (index, value))

val result: RDD[(Int, List[Int])] = withIndex
  .map { case (k, it) => (k, it.toList.sortBy(_._1).map(_._2)) } // order values and remove index