Over the years a few duplicates were added to our data tables as a result of spelling mistakes. So for example someone misspelled the name and instead of O'leary wrote Oleary. Our system thinks its a totally different name and does not complain about it, however in most cases its the same contact entered twice(I didn't build this system).
Now what I want to do is remove all of these duplicates, but I am having hard time building a query to display them all. I did experiment with UTL_MATCH and wrote a query that would return all similar names if I supply the name.
select first_name from customers
where UTL_MATCH.edit_distance_similarity(first_name,'Oleary') > 60
order by first_name;
Something like this would technically work.
select c1.first_name, c2.first_name from customers c1 cross join customers c2 where utl_match.edit_distance_similarity( c1.first_name, c2.first_name ) > 60 order by c1.first_name
However, it would be extremely slow unless your
customers table happens to be very (very) small since you're comparing every row in the
customers table against every other row in the table (and your edit distance similarity cut-off is pretty low). In order to make that faster, you're probably going to have to make assumptions about your data or do something else that can be a preliminary filter. For example, if you assume that any duplicates start with the same first character or the same first few characters ignoring punctuation, then you can dramatically decrease the number of pairs that need to be matched at the risk of missing the fact that "Kustin" may be a typo duplicate of "Justin" where the first character is different. Requiring that
c2.customer_id > c1.customer_id would be another reasonable filter to consider assuming that you don't need every pair duplicated (i.e. a "Kustin/ Justin" row can exist without the equivalent "Justin/ Kustin" row).