Kzas Kzas - 4 months ago 10
SQL Question

SQL reduce dates to range "start - end"

I have repeat rows in a table with several dates:

ID STATE DATE
----------------------------
id01 connected 2015-04-04
id01 connected 2015-04-05
id01 connected 2015-04-08
id01 disconect 2015-04-11
id01 disconect 2015-04-12
id01 connected 2015-04-13


I want a query with "start date" and "end date", with this result:

ID STATE START DATE END DATE
----------------------------------------
id01 connected 2015-04-04 2015-04-10
id01 disconect 2015-04-11 2015-04-12
id01 connected 2015-04-13 XXXXXXXXXX


The last "end date" it's not important (last value, null, now()...)

The most important is detect date of change (in this example no row for 2015-04-10, and the same state happends in 2015-04-13).

Posible solution? (not valid)

SELECT ID, STATE, MIN(date), MAX(date)
FROM TABLE
GROUP BY ID, STATE;


Isn't valid because merge intervals:

ID STATE START DATE END DATE
----------------------------------------
id01 connected 2015-04-04 XXXXXXXXXX
id01 disconect 2015-04-11 2015-04-12


Query has run in Impala (similar SQL92)

Answer

Impala supports window functions. This problem is a "gap-and-islands" problem, so it can be solved using a difference of row numbers:

select id, state, min(date) as start_date, max(date) as end_date
from (select t.*,
             row_number() over (partition by id order by date) as seqnum_id,
             row_number() over (partition by id, state order by date) as seqnum_isd
      from table t
     ) t
group by id, state, (seqnum_id - seqnum_isd);

The logic for the difference is not difficult, but tricky when you first learn it. It helps to run the subquery and see what the row number values are -- and why the difference defines each group.