nonegiven72 nonegiven72 - 6 months ago 17
Ruby Question

Removing duplicates from Rails postgres database

I have a local PG database that was created using a Rails application. It has 600k records, of which ~200k are duplicates. I want to keep only 1 of the record and delete the duplicates. I write SQL everyday for work but Rails is my hobby and thus I struggle with Active Record.

Could you please tell me how to manipulate this so that it removes the duplicates?

Here's how I found the duplicates (in Rails console):

Summary.select(:map_id).group(:map_id).having("count(*) > 1")


I don't think I can simply add "destoy_all" to the end of that statement as it will destroy all instances of that entry, including the duplicate values.

Answer

This will destroy the duplicates in waves, selecting only a single duplicate per map_id, on each pass. The loop will automatically finish when no more duplicates exist.

loop do
  duplicates = Summary.select("MAX(id) as id, map_id").group(:map_id).having("count(*) > 1")
  break if duplicates.length == 0
  duplicates.destroy_all
end

If the database looks like this:

| id | map_id |
|  1 |    235 |
|  2 |    299 |
|  3 |    324 |
|  4 |    235 |
|  5 |    235 |
|  6 |    299 |
|  7 |    235 |
|  8 |    324 |
|  9 |    299 |

In the first wave, these records would be returned and destroyed:

| id | map_id |
|  7 |    235 |
|  8 |    324 |
|  9 |    299 |

In the second wave, this record would be returned and destroyed:

| id | map_id |
|  5 |    235 |
|  6 |    299 |

The third wave would return and destroy this record:

| id | map_id |
|  4 |    235 |

The fourth wave would complete the process. Unless there are numerous duplicates for a given map_id, it's likely that this process will finish in single-digit loop iterations.

Given the approach, only duplicates will ever be returned, and only the newer duplicates will be removed. To remove older duplicates, instead, the query can be changed to this:

  duplicates = Summary.select("MIN(id) as id, map_id").group(:map_id).having("count(*) > 1")

In that case, wave 1 would return and destroy:

| id | map_id |
|  1 |    235 |
|  2 |    299 |
|  3 |    324 |

Wave 2 would return and destroy:

| id | map_id |
|  4 |    235 |
|  6 |    299 |

Wave 3 would return and destroy:

| id | map_id |
|  5 |    235 |

Wave 4 would complete the process.

Comments