Mnemosyne Mnemosyne - 6 months ago 38
Scala Question

Spark Dataframe: How to aggregate both numerical and nominal columns

I am using Spark Dataframes and have dataframe

similar to this:

id: String | amount: Double | donor: String
1 | 50 | Mary
2 |100 | Michael
1 | 60 | Minnie
1 | 20 | Mark
2 | 55 | Mony

I want to aggregate my dataframe in one go and get this output:

id: String | amount: Double | donor: Seq[String]
1 |130 | {Mary,Minnie,Mark}
2 |155 | {Michael, Mony}

So I want to do something like:


Aggregating the sum of the numbers is easy, but I can't find a way to aggregate the text content as a Sequence or Array (or any similar type that is Iterable). How can I do this in scala/spark?


I am looking for some spark Dataframe or RDD based function to do the collection of strings. Functions as the below mentioned
are Hive based and I need specific dependencies for that. But I am not using Hive at all in my project.



df.groupyBy("id").agg(sum("amount"), collect_list("donor"))


df.groupyBy("id").agg(sum("amount"), collect_set("donor"))