finman finman - 11 months ago 86
Scala Question

Spark : Average of values instead of sum in reduceByKey using Scala

When reduceByKey is called it sums all values with same key. Is there any way to calculate the average of values for each key ?

// I calculate the sum like this and don't know how to calculate the avg

Array(((Type1,1),4.0), ((Type1,1),9.2), ((Type1,2),8), ((Type1,2),4.5), ((Type1,3),3.5),
((Type1,3),5.0), ((Type2,1),4.6), ((Type2,1),4), ((Type2,1),10), ((Type2,1),4.3))

Answer Source

One way is to use map values and reduceByKey which is easier than aggregateByKey.

.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues(y => 1.0 * y._1 / y._2).collect