Dean - 2 months ago 22

Scala Question

I have a spark data frame in scala called df with two columns, say a and b. Column a contains letters and column b contains numbers giving the below.

`a b`

----------

g 0

f 0

g 0

f 1

I can get the distinct rows using

`val dfDistinct=df.select("a","b").distinct`

which gives the following:

`a b`

----------

g 0

f 0

f 1

I want to add another column with the number of times these distinct combinations occurs in the first dataframe so I'd end up with

`a b count`

----------

g 0 2

f 0 1

f 1 1

I don't mind if that modifies the original command or I have a separate operation on dfDistinct giving another data frame.

Any advice greatly appreciated and I apologise for the trivial nature of this question but I'm not the most experienced with this kind of operation in scala or spark.

Thanks

Dean

Answer

You can simply aggregate and count:

```
df.groupBy($"a", $"b").count
```

or a little bit more verbose:

```
import org.apache.spark.sql.functions.{count, lit}
df.groupBy($"a", $"b").agg(count(lit(1)).alias("cnt"))
```

Both are equivalent to a raw SQL aggregation:

```
df.registerTempTable("df")
sqlContext.sql("SELECT a, b, COUNT(1) AS cnt FROM df GROUP BY a, b")
```