Laurens van Dam Laurens van Dam - 1 month ago 13
SQL Question

Spark SQL: find maximum and sum grouped by key

I have a Spark DataFrame with the following schema.

counts
|-- index: string
|-- name: string
|-- count: long


I would like to sum the
count
column for each
index
and also find the maximum
count
together with its corresponding
name
. So each row would contain a unique
index
with
SUM(count)
,
MAX(count)
and the corresponding
name
.

I group the entries by
index
and select the row with the highest
count
by running the following query.

SELECT * FROM counts
WHERE (index, count)
IN (
SELECT index, MAX(count)
FROM counts
GROUP BY index
)
ORDER BY count DESC


This works fine and gives me a unique row per
index
containing the row with the highest
count
.

However, I would like to add a column that contains the sum of
count
grouped by
index
based on the original table.

I can sum the view counts and group them by
index
with this query:

SELECT index, SUM(count)
FROM counts
GROUP BY index


But I would like to have a unified table containing the results of both queries (queried in an efficient way).

Any suggestions on this matter is much appreciated.

Answer

Just use window functions:

SELECT c.*
FROM (SELECT c.*, SUM(count) OVER (PARTITION BY index) as cnt,
             ROW_NUMBER() OVER (PARTITION BY index ORDER BY count DESC) as seqnum
      FROM counts c
     ) c
WHERE seqnum = 1;

This assumes you have other columns you want to keep. If you just want the max and the sum for each index:

SELECT index, MAX(count), SUM(count)
FROM counts
GROUP BY index;

And index is a bad name for a column, because it is a SQL keyword (think "create index").

Comments