K.I K.I - 1 month ago 6
SQL Question

Oracle, how to remove almost same rows from the table

Over the years a few duplicates were added to our data tables as a result of spelling mistakes. So for example someone misspelled the name and instead of O'leary wrote Oleary. Our system thinks its a totally different name and does not complain about it, however in most cases its the same contact entered twice(I didn't build this system).

Now what I want to do is remove all of these duplicates, but I am having hard time building a query to display them all. I did experiment with UTL_MATCH and wrote a query that would return all similar names if I supply the name.

select first_name from customers
where UTL_MATCH.edit_distance_similarity(first_name,'Oleary') > 60
order by first_name;


However I would like to build a query that would automatically return all possible duplicates without having to supply the name. Could anyone point me in the correct direction please?

Answer

Something like this would technically work.

select c1.first_name, c2.first_name
  from customers c1
       cross join customers c2
 where utl_match.edit_distance_similarity( c1.first_name, c2.first_name ) > 60
 order by c1.first_name

However, it would be extremely slow unless your customers table happens to be very (very) small since you're comparing every row in the customers table against every other row in the table (and your edit distance similarity cut-off is pretty low). In order to make that faster, you're probably going to have to make assumptions about your data or do something else that can be a preliminary filter. For example, if you assume that any duplicates start with the same first character or the same first few characters ignoring punctuation, then you can dramatically decrease the number of pairs that need to be matched at the risk of missing the fact that "Kustin" may be a typo duplicate of "Justin" where the first character is different. Requiring that c2.customer_id > c1.customer_id would be another reasonable filter to consider assuming that you don't need every pair duplicated (i.e. a "Kustin/ Justin" row can exist without the equivalent "Justin/ Kustin" row).