Peter Tan Peter Tan - 1 month ago 10
Scala Question

Need help to group by then sort by value on an rdd at apache spark via scala

I currently have the following format of data in an RDD:

("UserId, "TypeId", Score)


The typeId ranges from 1 to 57 for each user.

So the data looks like the following:

RDD(
("user1", 1, 1.0),
("user1", 2, 2.0),
("user2", 1, 3.0),
("user2", 2, 4.0),
("user3", 1, 5.0),
("user3", 2, 6.0))


I will need to flatten this out to be:

("User1", 1.0, 2.0, ...),
("User2", 3.0, 4.0, ...),
("User3", 5.0, 6.0, ...)


Can anyone help me to point to the direction to tackle this?

Answer

You could convert the RDD to keyed RDD , then use groupByKey on that RDD.

val sparkContext = new SparkContext("local", "Simple App")
  val ss = List(("user1", 1, 1.0), ("user1", 2, 2.0), ("user2", 1, 3.0), ("user2", 2, 4.0), ("user3", 1, 5.0), ("user3", 2, 6.0))
  val rdd = sparkContext.parallelize(ss, 2)
  val keyedRDD = rdd map { s => (s._1, s._3) }
  val groupedRDD = keyedRDD.groupByKey()
  //val groupedRDD = keyedRDD.reduceByKey((x,y) => x+y)
  groupedRDD.foreach(record => {
    println(record)
  })
Comments