user1384205 user1384205 - 26 days ago 9
Scala Question

Add column index to dataframe based on another column (user in this case)

I have a dataframe as given below where the last column represents the number of times the user has searched for the location and stay

| Hanks| Rotterdam| airbnb7| 1|
|Sanders| Rotterdam| airbnb2| 1|
| Hanks| Rotterdam| airbnb2| 3|
| Hanks| Tokyo| airbnb8| 2|
| Larry| Hanoi| | 2|
| Mango| Seoul| airbnb5| 1|
| Larry| Hanoi| airbnb1| 2|


which i want to transform as follows

| Hanks| Rotterdam| airbnb7| 1| 1|
|Sanders| Rotterdam| airbnb2| 1| 1|
| Hanks| Rotterdam| airbnb2| 3| 2|
| Hanks| Tokyo| airbnb8| 2| 3|
| Larry| Hanoi| | 2| 0|
| Mango| Seoul| airbnb5| 1| 1|
| Larry| Hanoi| airbnb1| 2| 1|


Notice that column 5 represents the index of the unique combination of options(location+stay) that user selected.
eg

| Hanks| Rotterdam| airbnb7| 1| 1|
| Hanks| Rotterdam| airbnb2| 3| 2|
| Hanks| Tokyo| airbnb8| 2| 3|


I tried using groupBy/Agg to do this by implementing a udf function as the following in the agg function.

val df2 = df1.groupBy("User", "clickedDestination", "clickedAirbnb")
.agg(indexUserDetailsUDF(col("clickedAirbnb")) as ("clickedAirbnbIndex"))


And the udf as follows

var cnt = 0
val airbnbClickIndex:(String) => String = (airbnb) => {
if(airbnb== "") "null" //return 0 for airbnbClickIndex when airbnb is empty
else{cnt+=1; cnt.toString()} //otherwise return incremented value
}
val indexUserDetailsUDF = udf(airbnbClickIndex)


But this is not working. Any input is much appreciated.
Thanks.

Answer

If I got it right, you probably want a windowed rank. You could try the following:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val window = Window.partitionBy("User").orderBy("User", "clickedDestination", "clickedAirbnb")

val result = df.withColumn("clickedAirbnbIndex", dense_rank().over(window))

If needed, you can find some good reading about window functions in Spark here.

Also, the functions package api documentation is very useful.