Jason Zhu Jason Zhu - 1 day ago 4
SQL Question

split row into multiple sessions by time difference in pyspark

Here's the pseudo data:

user ts
--------
1 1
1 3
1 10
1 13
1 21
1 24


It will be split into two sessions if adjacent time difference is >= 6 for each user. So, the above data should be split as below:

user ts diff
-------------------
1 1 None
1 3 2
-------------------
1 10 7
1 13 3
-------------------
1 21 8
1 24 3


I understand how to generate the diff column in
pyspark
via Window function illustrated below, but how could I split it into different sessions for each user in
pyspark
manner? Great thanks!

select
user,
ts,
(lag(ts, 1) over (partition by user order by ts asc)) as diff
from user_event

Answer

You have the right beginning. The SQL would continue as:

select user, ts, diff,
       sum(case when diff > 6 then 1 else 0 end) over (partition by user order by ts) as session_grouping
from (select user, ts,
             lag(ts, 1) over (partition by user order by ts asc) as diff
      from user_event
     ) ue;
Comments