Gridou Gridou - 18 days ago 10
Scala Question

spark reduceByKey performance/complexity when reducing lists with scala

I need to perform a reduceByKey on lists. What woud be the fastest solution ? I'm using the

:::
operator to merge 2 list in the reduce operation, but
:::
is O(n) so i'm afraid the reduce operation will end up beeing O(n2).

Code example :

val rdd: RDD[int, List[int]] = getMyRDD()
rdd.reduceByKey(_ ::: _)


What would be the best/most efficient solution ?

Answer

The best you can do is:

rdd.groupByKey.mapValues(_.flatten.toList)

This will:

  • Skip obsolete map-side reduce. It requires marginally larger shuffle but significantly reduces GC time.
  • Use mutable buffer with amortized constant append time for intermediate aggregations.
  • Flatten intermediate aggregate in O(N) time.