Aren Cambre Aren Cambre - 3 months ago 15
SQL Question

Add number to rows based on identical values in selected columns

I have a PostgreSQL database that contains traffic tickets written by a few jurisdictions.

Some jurisdictions don't indicate if multiple tickets are written in one traffic stop. However, that can be inferred by analyzing other fields. Consider this data:

ticket_id timestamp drivers_license
----------------------------------------------
1 2008-08-07 01:51:00 11111111
2 2008-08-07 01:51:00 11111111
3 2008-08-07 02:02:00 22222222
4 2008-08-07 02:25:00 33333333
5 2008-08-07 04:23:00 44444444
6 2008-08-07 04:23:00 55555555
7 2008-08-07 04:23:00 44444444


I can infer that:


  • Tickets 1 & 2 were written in a single traffic stop because they share driver's license numbers and timestamps.

  • Same for 5 & 7, but notice how ticket 6 is between them. Perhaps another officer was writing a ticket at the same time somewhere else, or data entry operators entered stuff out of order.



I would like to add another column that has a unique ID for each traffic stop. It will not be a primary key for the table because it will have duplicate values. For example:

ticket_id timestamp drivers_license stop_id
--------------------------------------------------------
1 2008-08-07 01:51:00 11111111 1
2 2008-08-07 01:51:00 11111111 1
3 2008-08-07 02:02:00 22222222 2
4 2008-08-07 02:25:00 33333333 3
5 2008-08-07 04:23:00 44444444 4
6 2008-08-07 04:23:00 55555555 5
7 2008-08-07 04:23:00 44444444 4


I can think of computationally-intensive, greedy algorithm ways of doing this with C#, but is there an efficient SQL query that can work?

Answer

Efficient SQL Query FTW!

I'm not at a computer that I can test this on so there is likely some syntax problems; I will fix in the morning, but it is something like this:

WITH uniquez as (SELECT timestamp, drivers_license, 
rank() over (ORDER BY timestamp, drivers_license) as counterz 
FROM ticketTable)

UPDATE ticketTable TT
SET stop_id = uniquez.counterz
WHERE uniquez.timestamp = TT.timestamp
AND uniquez.drivers_license = TT.drivers_license

Basically, you make a select that groups (partitions) by timestamp and drivers_license and have a row counter that goes with it. When you do the update, you use this previous select table's row counter as your "stop_id" and updates the columns that match the timestamp and drivers license.