xpil - 1 year ago 68

SQL Question

*TL;DR: scroll down to TASK 2.*

I am dealing with the following data set:

`email,createdby,createdon`

a@b.c,jsmith,2016-10-10

a@b.c,nsmythe,2016-09-09

a@b.c,vstark,2016-11-11

b@x.y,ajohnson,2015-02-03

b@x.y,elear,2015-01-01

...

and so on. Each email is guaranteed to have at least one duplicate in the data set.

Now, there are two tasks to resolve; I resolved one of them but am struggling with the other one. I will now present both tasks for completeness.

For each row, for each email, return an additional column with the name of the user that created the first record with this email.

Expected result for the above sample data set:

`email,createdby,createdon,original_createdby`

a@b.c,jsmith,2016-10-10,nsmythe

a@b.c,nsmythe,2016-09-09,nsmythe

a@b.c,vstark,2016-11-11,nsmythe

b@x.y,ajohnson,2015-02-03,elear

b@x.y,elear,2015-01-01,elear

Code to get the above:

`;WITH q0 -- this is just a security measure in case there are unique emails in the data set`

AS ( SELECT t.email

FROM t

GROUP BY t.email

HAVING COUNT(*) > 1) ,

q1

AS ( SELECT q0.email

, createdon

, createdby

, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn

FROM t

JOIN q0

ON t.email = q0.email)

SELECT q1.email

, q1.createdon

, q1.createdby

, LAG(q1.createdby, q1.rn - 1) OVER ( ORDER BY q1.email, q1.createdon ) original_createdby

FROM q1

ORDER BY q1.email

, q1.rn

Brief explanation: I partition data set by email, then I number rows in each partition ordered by creation date, finally I return [createdby] value from (rn-1)th record. Works exactly as expected.

Now, similar to the above, there is TASK 2:

For each row, for each email, return name of the user that created the first duplicate. I.e. name of a user where rn=2.

Expected result:

`email,createdby,createdon,first_dupl_createdby`

a@b.c,jsmith,2016-10-10,jsmith

a@b.c,nsmythe,2016-09-09,jsmith

a@b.c,vstark,2016-11-11,jsmith

b@x.y,ajohnson,2015-02-03,ajohnson

b@x.y,elear,2015-01-01,ajohnson

I want to keep things performant so trying to employ LEAD-LAG functions:

`WITH q0`

AS ( SELECT t.email

FROM t

GROUP BY t.email

HAVING COUNT(*) > 1) ,

q1

AS ( SELECT q0.email

, createdon

, createdby

, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn

FROM t

JOIN q0

ON t.email = q0.email)

SELECT q1.email

, q1.createdon

, q1.createdby

, q1.rn

, CASE q1.rn

WHEN 1 THEN LEAD(q1.createdby, 1) OVER ( ORDER BY q1.email, q1.createdon )

ELSE LAG(q1.createdby, q1.rn - 2) OVER ( ORDER BY q1.email, q1.createdon )

END AS first_dupl_createdby

FROM q1

ORDER BY q1.email

, q1.rn

Explanation: for the first record in each partition, return [createdby] from the following record (i.e. from the record containing the first duplicate). For all other records in the same partition return [createdby] from (rn-2) records ago (i.e. for rn = 2 we're staying on the same record, for rn = 3 we're going 1 record back, for rn = 4 - 2 records back and so on).

An issue comes up on the

`ELSE LAG(q1.createdby, q1.rn - 2)`

operation. Apparently, against any logic, despite the existence of the preceding line (WHEN 1 THEN...), the ELSE block is also evaluated for rn = 1, resulting in a negative offset value passed to the LAG function:

Offset parameter for Lag and Lead functions cannot be a negative value.

When I comment out that ELSE line, the whole thing works fine but obviously I am not getting any results in the first_dupl_createdby column for rn > 1.

QUESTION:

Is there any way of re-writing the above CASE statement (in TASK #2) so that it always returns the value from a record where rn = 2 within each partition but - and this is important bit - without doing a self-JOIN operation (I know I could prepare rows where rn = 2 in a separate sub-query but this would mean extra scans on the whole table and also running an unnecessary self-JOIN).

Answer Source

I think you can simply use the `max`

window function as you are trying to get the value from rownumber = 2 for each partition.

```
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, max(case when rn=2 then q1.createdby end) over(partition by q1.email) first_dup_created_by
FROM q1
ORDER BY q1.email, q1.rn
```

You can use a similar query to get the results for rownumber=1 for the 1st scenario as well.