Ben Ben - 4 months ago 45
SQL Question

RANK records partitioned by a column in series (Vertica SQL)

I'm trying to use the Vertica rank analytic function to create a rank column partitioned by a column, but only include records that are in a series. For example the query below produces the output below the query

select when_created, status
from tablea

when_created Status
1/1/2015 ACTIVE
3/1/2015 ACTIVE
4/1/2015 INACTIVE
4/6/2015 INACTIVE
6/7/2015 ACTIVE
10/9/2015 INACTIVE


I could modify my query to include a rank column which would produce the following output

select
when_created, status, rank() OVER (PARTITION BY status order by when_created) as rnk
from tablea

when_created Status rnk
1/1/2015 ACTIVE 1
3/1/2015 ACTIVE 2
4/1/2015 INACTIVE 1
4/6/2015 INACTIVE 2
6/7/2015 ACTIVE 3
10/9/2015 INACTIVE 3


However my goal is start over the rank when a series is broken so the desired output is:

when_created Status rnk
1/1/2015 ACTIVE 1
3/1/2015 ACTIVE 2
4/1/2015 INACTIVE 1
4/6/2015 INACTIVE 2
6/7/2015 ACTIVE 1
10/9/2015 INACTIVE 1


Is there a way to accomplish this using the RANK function or is there another way to do it in vertica sql?

Thanks,
Ben

Answer

This is a gap-and-islands problem, where the tricky part is to identify the groups to use for a row_number() calculation. One solution uses a difference of row numbers to identify the different groups:

select a.*,
       row_number() over (partition by status, seqnum - seqnum_s order by when_created) as rnk
from (select a.*,
             row_number() over (order by when_created) as seqnum,
             row_number() over (partition by status order by when_created) as seqnum_s
      from tablea a
     ) a;

The logic behind this is tricky when you first see it. I advise you to run the subquery and understand the two row_number() calculations -- and to observe that the difference is constant for the groups you are interested in.