Darshan Darshan - 3 months ago 5
Scala Question

Count operation in reduceByKey in spark

val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))


Here I have performed Sum operation But Is it possible to do count operation inside reduceByKey.

Like what i think,

reduceByKey((x, y) => (math.count(x._1),(x._2+y._2)))


But this is not working any suggestion please.

Answer

Well, counting is equivalent to summing 1s, so just map the first item in each value tuple into 1 and sum both parts of the tuple like you did before:

val temp1 = tempTransform.map { temp => 
   ((temp.getShort(0), temp.getString(1)), (1, USAGE_TEMP.getDouble(3)))
}
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))

Result would be an RDD[((Short, String), (Int, Double))] where the first item in the value tuple (the Int) is the number of original records matching that key.

That's actually the classic map-reduce example - word count.

Comments