karoma karoma - 3 months ago 51
Scala Question

Spark - Sum of row values

I have the following DataFrame:

January | February | March
-----------------------------
10 | 10 | 10
20 | 20 | 20
50 | 50 | 50


I'm trying to add a column to this which is the sum of the values of each row.

January | February | March | TOTAL
----------------------------------
10 | 10 | 10 | 30
20 | 20 | 20 | 60
50 | 50 | 50 | 150


As far as I can see, all the built in aggregate functions seem to be for calculating values in single columns. How do I go about using values across columns on a per row basis (using Scala)?

I've gotten as far as

val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...

Answer

You were very close with this:

val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...

Instead, try this:

val newDf = df.select(colsToSum.map(col).reduce((c1, c2) => c1 + c2) as "sum")

I think this is the best of the the answers, because it is as fast as the answer with the hard-coded SQL query, and as convenient as the one that uses the UDF. It's the best of both worlds -- and I didn't even add a full line of code!