raHul raHul - 2 months ago 31
Scala Question

Spark scala filter tuples in a list

I have an Rdd like below

val m = sc.parallelize(Seq(("a",("x",1)), ("a",("y",2)), ("a",("z",2)), ("b",("x",1)),("b",("y",2))))


I transformed the above Rdd by using the groupByKey like below

val b = m.groupByKey.mapValues( _.toList)


Result:

(a,List((x,1), (y,2), (z,2)))
(b,List((x,1), (y,2)))


Now, I want to filter the tuples with max values in each list
So the expected result would be

(a,List((y,2), (z,2)))
(b,List((y,2)))

Answer

Considering a sequence given is: val m = Seq(("a",("x",1)), ("a",("y",2)), ("a",("z",2)), ("b",("x",1)),("b",("y",2)))

val r1 = 
  m.groupBy(_._1)
   .map { case (k, v) => k -> v.map(_._2) }
   .map { case (k, v) => 
     k -> { 
       val sorted = v.sortWith { case (x, y) => x._2 > y._2 }
       val max = sorted.head._2

       sorted.takeWhile(_._2 == max) 
     }
   }
   .toList

Which gives the result as: r1: List[(String, Seq[(String, Int)])] = List((b,List((y,2))), (a,List((y,2), (z,2))))