Gridou Gridou - 3 months ago 58
Scala Question

spark reduceByKey performance/complexity when reducing lists with scala

I need to perform a reduceByKey on lists. What woud be the fastest solution ? I'm using the

operator to merge 2 list in the reduce operation, but
is O(n) so i'm afraid the reduce operation will end up beeing O(n2).

Code example :

val rdd: RDD[int, List[int]] = getMyRDD()
rdd.reduceByKey(_ ::: _)

What would be the best/most efficient solution ?


The best you can do is:


This will:

  • Skip obsolete map-side reduce. It requires marginally larger shuffle but significantly reduces GC time.
  • Use mutable buffer with amortized constant append time for intermediate aggregations.
  • Flatten intermediate aggregate in O(N) time.