foboi1122 foboi1122 - 4 months ago 39
Scala Question

Spark Aggregatebykey partitioner order

If I apply a hash partitioner to Spark's aggregatebykey function, i.e.

myRDD.aggregateByKey(0, new HashPartitioner(20))(combOp, mergeOp)


Does myRDD get repartitioned first before it's key/value pairs are aggregated using combOp and mergeOp? Or does myRDD go through combOp and mergeOp first and the resulting RDD is repartitioned using the HashPartitioner?

Answer

aggregateByKey applies map side aggregation before eventual shuffle. Since every partition is processed sequentially the only operation that is applied in this phase is initialization (creating zeroValue) and combOp. A goal of mergeOp is to combine aggregation buffers so it is not used before shuffle.

If input RDD is a ShuffledRDD with the same partitioner as requested for aggregateByKey then data is not shuffled at all and data is aggregated locally using mapPartitions.