If I apply a hash partitioner to Spark's aggregatebykey function, i.e.
myRDD.aggregateByKey(0, new HashPartitioner(20))(combOp, mergeOp)
aggregateByKey applies map side aggregation before eventual shuffle. Since every partition is processed sequentially the only operation that is applied in this phase is initialization (creating
combOp. A goal of
mergeOp is to combine aggregation buffers so it is not used before shuffle.
If input RDD is a
ShuffledRDD with the same partitioner as requested for
aggregateByKey then data is not shuffled at all and data is aggregated locally using