Laurens van Dam Laurens van Dam - 2 months ago 14
Scala Question

Spark SQL sum rows with the same key and appending the sum value

Suppose I have the following DataFrame.

+----+-----+
|lang|count|
+----+-----+
| en| 4|
| en| 5|
| de| 2|
| en| 2|
| nl| 4|
| nl| 5|
+----+-----+


How do I sum the values of “count” for each unique language, and appending this value as a new column (thus, without reducing the amount of rows)?

In my example, this would result in:

+----+-----+----------------+
|lang|count|totalCountInLang|
+----+-----+----------------+
| en| 4| 11|
| en| 5| 11|
| de| 2| 2|
| en| 2| 11|
| nl| 4| 9|
| nl| 5| 9|
+----+-----+----------------+


The DataFrames are constructed through a
map
operation on a
DStream
.

Any suggestions on what would be the best way to achieve this? Is there a more efficient way than using DataFrames?

Thanks in advance!

Answer

You can use one of the following:

  • sum over a window:

    val df = Seq(
      ("en", 4), ("en", 5), ("de", 2), 
      ("en", 2), ("nl", 4), ("nl", 5)
    ).toDF("lang", "count")
    
    val w = Window.partitionBy("lang").rowsBetween(Long.MinValue, Long.MaxValue)
    df.withColumn("totalCountInLang", sum("count").over(w))
    
  • aggregation and join:

    df.join(df.groupBy("lang").sum(), Seq("lang"))
    

With small groups the former solution should be behave slightly better. For larger ones the latter one, optionally combined with broadcast function, is usually proffered.

Comments