mls.z mls.z - 6 days ago 6
SQL Question

Pairing Sequential Events on Postgresql

We are logging the main flows of actions our users make on our iPad app on a table. Each flow has a start(tagged Started) and an end that is either tagged Cancelled or Finished, and there shouldn't be any overlapping events.

A set of flows Started, Cancelled or Finished for a user looks like this:

user_id timestamp event_text event_num
info@cafe-test.de 2016-10-30 00:08:00.966+00 Flow Started 0
info@cafe-test.de 2016-10-30 00:08:15.58+00 Flow Cancelled 2
info@cafe-test.de 2016-10-30 00:08:15.581+00 Flow Started 0
info@cafe-test.de 2016-10-30 00:34:44.134+00 Flow Finished 1
info@cafe-test.de 2016-10-30 00:42:26.102+00 Flow Started 0
info@cafe-test.de 2016-10-30 00:42:49.276+00 Flow Cancelled 2
info@cafe-test.de 2016-10-30 00:42:49.277+00 Flow Started 0
info@cafe-test.de 2016-10-30 00:59:47.337+00 Flow Cancelled 2
info@cafe-test.de 2016-10-30 00:59:47.337+00 Flow Started 0
info@cafe-test.de 2016-10-30 00:59:47.928+00 Flow Cancelled 2


We want to calculate how long a cancelled and finished flow last on average. For this we need to pair event Started with Canceled or Finished. The following code does that, however can't work around the following data quality issue that we have:


  • When a customer wants to start a new flow(let's call it Flow2) before ending the ongoing flow (Flow1), we shoot a cancelled event as we shoot the started event for the new flow. So
    Flow1 Cancelled=Flow2 Started
    . However when we use window functions to order and lead/lag between ordered events that actually belong to different flows get matched.
    By using this code:

    WITH track_scf AS (SELECT user_id, timestamp, event_text, CASE WHEN event_text LIKE '%Started%' THEN 0 when event_text like '%Cancelled%' then 2 ELSE 1 END AS event_num FROM tracks ORDER BY 2, 4 desc ) SELECT user_id, CASE WHEN event_num=0 then timestamp end as start,CASE WHEN LEAD(event_num, 1) OVER (PARTITION BY user_id ORDER BY timestamp,event_num) <> 0 THEN LEAD(timestamp, 1) OVER (PARTITION BY user_id ORDER BY timestamp,event_num) END as end, CASE WHEN LEAD(event_num, 1) OVER (PARTITION BY user_id ORDER BY timestamp,event_num) <> 0 THEN LEAD(event_num, 1) OVER (PARTITION BY user_id ORDER BY timestamp,event_num) END as action FROM track_scf



We get this result:

user_id start end action
info@cafe-test.de 2016-10-30 00:08:00.966+00 2016-10-30 00:08:15.58+00 2
info@cafe-test.de 2016-10-30 00:08:15.581+00 2016-10-30 00:34:44.134+00 1
info@cafe-test.de 2016-10-30 00:42:26.102+00 2016-10-30 00:42:49.276+00 2
info@cafe-test.de 2016-10-30 00:42:49.277+00 NULL NULL
info@cafe-test.de 2016-10-30 00:59:47.337+00 2016-10-30 00:59:47.337+00 2
info@cafe-test.de NULL 2016-10-30 00:59:47.928+00 2


But we should get this:

user_id start end action
info@cafe-test.de 2016-10-30 00:08:00.966+00 2016-10-30 00:08:15.58+00 2
info@cafe-test.de 2016-10-30 00:08:15.581+00 2016-10-30 00:34:44.134+00 1
info@cafe-test.de 2016-10-30 00:42:26.102+00 2016-10-30 00:42:49.276+00 2
info@cafe-test.de 2016-10-30 00:42:49.277+00 2016-10-30 00:59:47.337+00 2
info@cafe-test.de 2016-10-30 00:59:47.337+00 2016-10-30 00:59:47.928+00 2


How do I need to alter the code so that the pairing is correct?

Answer
select      user_id       
           ,"start"                       
           ,"end"                         
           ,"action"

from       (select      user_id
                       ,timestamp                 as "start"
                       ,lead (event_num)   over w as "action"
                       ,lead ("timestamp") over w as "end"
                       ,event_num

            from        tracks t

            window      w as (partition by user_id order by "timestamp",event_num desc)
            ) t

where       t.event_num = 0
;
Comments