nzn nzn - 18 days ago 7
Scala Question

ReduceByKey on specified value element

New to Spark and trying to understand

reduceByKey
, which is designated to accept RDD[(K, V)]. What isn't clear to me is how to apply this function when the value is a list/tuple...

After various mapping and filtering operations my RDD has ended up in the form of
(Cluster:String, (Unique_ID:String, Count:Int))
, in which I can have many elements belonging to the same cluster, e.g.:

Array((a,(lkn,12)), (a,(hdha,2)), (a,(naa,35)), (b, (cdas,20)) ...)


Now I want to use
reduceByKey
to find, for each cluster the element with the highest count (so one entry per cluster). In the above example this would be
(a,(naa,35))
for cluster
a
.

I can figure out how to find the maximum per cluster if I have a simple (key, value) pair with
reduceByKey
and
math.max
. But I don't understand how to extend this when value represents a list/tuple of values.

Am I using the wrong function here?

Answer

You can:

rdd.reduceByKey { case (x, y) => if (x._2 > y._2) x else y }

This:

  • logically partitions data into groups defined by the key (_._1)

    • Keys for "a": (a, [(lkn,12), (hdha,2), (naa,35), ...])
    • Keys for "b": (b, [(cdas,20), ...])
  • reduces values in each group by comparing second element of values ((x._2 > y._2)) and returns one with the higher number.