Mnemosyne Mnemosyne - 2 months ago 6
Scala Question

Spark Dataframe: How to aggregate both numerical and nominal columns

I am using Spark Dataframes and have dataframe

df
similar to this:

id: String | amount: Double | donor: String
--------------------------------------------
1 | 50 | Mary
2 |100 | Michael
1 | 60 | Minnie
1 | 20 | Mark
2 | 55 | Mony


I want to aggregate my dataframe in one go and get this output:

id: String | amount: Double | donor: Seq[String]
--------------------------------------------
1 |130 | {Mary,Minnie,Mark}
2 |155 | {Michael, Mony}


So I want to do something like:

df.groupyBy("id").agg(sum("amount"),_?Seq?_("donor"))


Aggregating the sum of the numbers is easy, but I can't find a way to aggregate the text content as a Sequence or Array (or any similar type that is Iterable). How can I do this in scala/spark?

EDIT:

I am looking for some spark Dataframe or RDD based function to do the collection of strings. Functions as the below mentioned
collect_set
are Hive based and I need specific dependencies for that. But I am not using Hive at all in my project.

Answer

Try:

df.groupyBy("id").agg(sum("amount"), collect_list("donor"))

or

df.groupyBy("id").agg(sum("amount"), collect_set("donor"))
Comments