j__ j__ - 21 days ago 6
SQL Question

Finding first sighting in SQL

We have a time series in an spark sql table which describes every time a user does an event.

However, users tend to do many events in a burst. I want to find the min time for everyone of these bursts.

Unfortunately this is historical data so I cant change how the table was created. So I essentially want a

select min(time_), user from my_table group by user
, but for each burst. Any help would be much appreciated!

EDIT:

Some example data would be:

user time_
0 10
0 11
2 12
0 12
2 13
2 15
0 83
0 84
0 85


so for example in the above data I would like to find (0, 10), (2, 12) and (0, 83). We can say that a burst occurs if it is within 1 hour (that would be 60 in the above example data).

Answer

If this is the only information you need:

select      user
           ,time_

from       (select      user
                       ,time_
                       ,case when time_ - lag (time_,1,time_-60) over (partition by user order by time_) >= 60 then 'Y' else null end  as burst

            from        my_table 
            ) t

where       burst = 'Y'
;

user    time_
0       10
0       83
2       12

If you'll need to gather some additional information on each burst:

select      user
           ,burst_seq

           ,min (time_) as min_time_
           ,max (time_) as max_time_
           ,count (*)   as events_num

from       (select      user
                       ,time_

                       ,count(burst) over 
                        (
                            partition by    user 
                            order by        time_  
                            rows unbounded preceding
                        ) + 1                           as burst_seq

            from       (select      user
                                   ,time_
                                   ,case when time_ - lag (time_) over (partition by user order by time_) >= 60 then 'Y' else null end as burst

                        from        my_table 
                        ) t
            ) t

group by    user
           ,burst_seq
;

user    burst_seq   min_time_   max_time_   events_num

0       1           10          12          3
0       2           83          85          3
2       1           12          15          3

P.s. There seems to be a bug with the CASE statement.
case when ... then 'Y' end yields FAILED: IndexOutOfBoundsException Index: 2, Size: 2 although it is a legal syntax.
Adding else null solved it.

Comments