user2783069 user2783069 - 7 months ago 10
SQL Question

understand the meaning of DATEDIFF + LAG

I need to understand what the meaning in this calculation :

DATEDIFF(days, lag(recday, 1) OVER (PARTITION BY udid
ORDER BY recday), recday)


And how can i implement it without using
lag
and
datediff
, for Amazon Redshift, which doesn't have
datediff
.

This is the full query :

SELECT udid
,recday AS day
,count(*) AS session_count
,DATEDIFF(days, lag(recday, 1) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction1
,DATEDIFF(days, lag(recday, 2) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction2
,DATEDIFF(days, lag(recday, 3) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction3
,DATEDIFF(days, lag(recday, 4) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction4
,DATEDIFF(days, lag(recday, 5) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction5
,DATEDIFF(days, lag(recday, 6) OVER (PARTITION BY udid
ORDER BY recday), recday)
AS repeat_transaction6
FROM vvdays


this is how my data looks -

10000001 2016-02-03 17:26:03.0
10000001 2016-02-08 21:36:07.0
10000001 2016-02-10 07:48:06.0
10000012 2016-02-06 22:06:42.0
10000012 2016-02-06 22:07:42.0
10000028 2016-02-04 13:18:48.0
10000028 2016-02-04 13:30:42.0
10000028 2016-02-04 13:30:55.0
10000028 2016-02-05 16:48:41.0
10000028 2016-02-05 16:58:34.0
10000028 2016-02-07 15:44:33.0
10000028 2016-02-07 16:29:00.0
10000039 2016-02-03 21:16:49.0
10000039 2016-02-03 21:17:50.0
10000039 2016-02-03 21:18:49.0
10000039 2016-02-03 21:19:49.0
10000039 2016-02-03 21:20:50.0
10000039 2016-02-03 21:21:50.0
10000039 2016-02-03 21:22:51.0
10000039 2016-02-03 21:23:53.0
10000039 2016-02-03 21:24:49.0
10000039 2016-02-03 21:25:50.0
10000039 2016-02-03 21:26:50.0
10000039 2016-02-03 21:27:49.0
10000039 2016-02-05 21:58:59.0
10000039 2016-02-05 21:59:58.0
10000039 2016-02-05 22:00:58.0
10000039 2016-02-05 22:01:58.0
10000039 2016-02-05 22:02:59.0
10000039 2016-02-05 22:03:58.0
10000039 2016-02-05 22:05:00.0
10000039 2016-02-05 22:05:58.0
10000039 2016-02-05 22:06:58.0

Answer

Without seeing your data, I'm guessing that your table 'vvdays' contains the two fields 'udid' and 'recday'. The LAG function is getting the second, third, fourth, fifth, sixth and seventh rows of data based upon the udid. The DATEDIFF is then comparing the first 'recday' to the other lines and returning the number of days between these two dates.

How you replicate this in redshift is another issue, you could look at using UNPIVOT to get the top 7 results into the same row and then running your DATEDIFF equivalent functions on the fields themselves.

EDIT: Ok, I've designed a really hacky way of getting this working;

Create temp table for testing purposes;

CREATE TABLE #vvdays (udid int, recday datetime)

Inserting some data EDIT: now using OP's supplied data;

VALUES 
('10000001', '2016-02-03 17:26:03.0') 
,('10000001', '2016-02-08 21:36:07.0') 
,('10000001', '2016-02-10 07:48:06.0') 
,('10000012', '2016-02-06 22:06:42.0') 
,('10000012', '2016-02-06 22:07:42.0') 
,('10000028', '2016-02-04 13:18:48.0') 
,('10000028', '2016-02-04 13:30:42.0') 
,('10000028', '2016-02-04 13:30:55.0') 
,('10000028', '2016-02-05 16:48:41.0') 
,('10000028', '2016-02-05 16:58:34.0') 
,('10000028', '2016-02-07 15:44:33.0') 
,('10000028', '2016-02-07 16:29:00.0') 
,('10000039', '2016-02-03 21:16:49.0') 
,('10000039', '2016-02-03 21:17:50.0') 
,('10000039', '2016-02-03 21:18:49.0') 
,('10000039', '2016-02-03 21:19:49.0') 
,('10000039', '2016-02-03 21:20:50.0') 
,('10000039', '2016-02-03 21:21:50.0') 
,('10000039', '2016-02-03 21:22:51.0') 
,('10000039', '2016-02-03 21:23:53.0') 
,('10000039', '2016-02-03 21:24:49.0') 
,('10000039', '2016-02-03 21:25:50.0') 
,('10000039', '2016-02-03 21:26:50.0') 
,('10000039', '2016-02-03 21:27:49.0') 
,('10000039', '2016-02-05 21:58:59.0') 
,('10000039', '2016-02-05 21:59:58.0') 
,('10000039', '2016-02-05 22:00:58.0') 
,('10000039', '2016-02-05 22:01:58.0') 
,('10000039', '2016-02-05 22:02:59.0') 
,('10000039', '2016-02-05 22:03:58.0') 
,('10000039', '2016-02-05 22:05:00.0') 
,('10000039', '2016-02-05 22:05:58.0') 
,('10000039', '2016-02-05 22:06:58.0')

Really horrible query to get this working. With the restrictions you've mentioned and my lack of Amazon specific knowledge I've done the first two values for you below. If you did it this way you're going to end up with a massive statement but it will work. I'd heavily recommend researching further to see what equivalent functions you have available to you;

SELECT day1.udid
    ,MAX(day1.recday) day1
    ,MAX(day2.recday) day2
    ,DATEDIFF(day,MAX(day2.recday),MAX(day1.recday)) DayDiff
FROM #vvdays day1
LEFT JOIN (
    SELECT a.udid
        ,MAX(a.recday) recday
    FROM #vvdays a
    LEFT JOIN (
        SELECT udid
            ,MAX(recday) recday
        FROM #vvdays
        GROUP BY udid
        ) b ON a.udid = b.udid
    WHERE a.recday <> b.recday
    GROUP BY a.udid
    ) day2 ON day1.udid = day2.udid
GROUP BY day1.udid

The reason I have used MAX in for 'day1' is to return the first date. I have used it in 'day2' at the top level purely to turn it into an aggregate field, you'll only get one result here, it's a false aggregate that just lets the GROUP BY work correctly.