Raul H Raul H - 1 month ago 6
Scala Question

Take first n records from dataframe grouped by unique id

I have my Dataset like this

enter image description here

As you see is ordered by rating and userId I need to get a new Dataframe with only the top 2 results of each group by unique user_id I've tried to

dataframe.groupBy("user_id").agg(someUdfFuntion)


I tried to use rank function but it seems not to work,I tried to filter the dataframe but no result how could I accomplish this?

Answer

Try:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number

val window = Window.partitionBy("userId").orderBy($"rating".desc)

dataframe.withColumn("r", row_number.over(window)).where($"r" <= n)