We have a time series in an spark sql table which describes every time a user does an event.
However, users tend to do many events in a burst. I want to find the min time for everyone of these bursts.
Unfortunately this is historical data so I cant change how the table was created. So I essentially want a
select min(time_), user from my_table group by user
If this is the only information you need:
select user ,time_ from (select user ,time_ ,case when time_ - lag (time_,1,time_-60) over (partition by user order by time_) >= 60 then 'Y' else null end as burst from my_table ) t where burst = 'Y' ;
user time_ 0 10 0 83 2 12
If you'll need to gather some additional information on each burst:
select user ,burst_seq ,min (time_) as min_time_ ,max (time_) as max_time_ ,count (*) as events_num from (select user ,time_ ,count(burst) over ( partition by user order by time_ rows unbounded preceding ) + 1 as burst_seq from (select user ,time_ ,case when time_ - lag (time_) over (partition by user order by time_) >= 60 then 'Y' else null end as burst from my_table ) t ) t group by user ,burst_seq ;
user burst_seq min_time_ max_time_ events_num 0 1 10 12 3 0 2 83 85 3 2 1 12 15 3
There seems to be a bug with the CASE statement.
case when ... then 'Y' end yields FAILED: IndexOutOfBoundsException Index: 2, Size: 2 although it is a legal syntax.
else null solved it.