I want to filter out duplicate customer names from a database. A single customer may have more than one entry to the system with the same name but with little difference in spelling. So here is an example: A customer named Brook may have three entries to the system
with this variations:
- Brook Berta
- Bruck Berta
- Biruk Berta
Let's assume we are putting this name in one database column.
I would like to know the different mechanisms to identify such duplications form say a 100,000 records. We may use regular expressions in C# to iterate through all records or some other pattern matching technique or we may export these records to what ever best fits for such queries (SQL with Regular Expression capabilities)).
This is what I thought as a solution
- Write a C# code to iterate through each record
- Get only the Consonant letters in order (in the above case: BrKBrt)
- Search for the same Consonant pattern from the other records considering
similar sounding letters like (C,K) (C,S), (F, PH)
So please forward any ideas.