Dean Dean - 4 days ago 4
Scala Question

Count instances of combination of columns in spark dataframe using scala

I have a spark data frame in scala called df with two columns, say a and b. Column a contains letters and column b contains numbers giving the below.

a b
----------
g 0
f 0
g 0
f 1


I can get the distinct rows using

val dfDistinct=df.select("a","b").distinct


which gives the following:

a b
----------
g 0
f 0
f 1


I want to add another column with the number of times these distinct combinations occurs in the first dataframe so I'd end up with

a b count
----------
g 0 2
f 0 1
f 1 1


I don't mind if that modifies the original command or I have a separate operation on dfDistinct giving another data frame.

Any advice greatly appreciated and I apologise for the trivial nature of this question but I'm not the most experienced with this kind of operation in scala or spark.

Thanks

Dean

Answer

You can simply aggregate and count:

df.groupBy($"a", $"b").count

or a little bit more verbose:

import org.apache.spark.sql.functions.{count, lit}

df.groupBy($"a", $"b").agg(count(lit(1)).alias("cnt"))

Both are equivalent to a raw SQL aggregation:

df.registerTempTable("df")

sqlContext.sql("SELECT a, b, COUNT(1) AS cnt FROM df GROUP BY a, b")
Comments