TrolliOlli TrolliOlli - 8 months ago 35
MySQL Question

Managing Names in Mysql when they can be slightly different (e.g. "Matt" vs "Mathew")

I have a program that pulls sports data from multiple sources and congregates it in one central location.

I'm using mysql to manage the data, and I enter a lot of the data using load commands with .csv files.

I'm running into problems however when different sources switch between common spellings of the same name, like "Matthew" and "Matt", or "Michael" and "Mike".

I was thinking of perhaps having my program go through the database after it's loaded and manually change every name to some standard form, such as firstInitial.LastName.Team, so "Matt Johnson" on the team XYZ would become "M.Johnson.XYZ". This should work, but it seems hacky and leaves me open for cases where there's two players with the same firstInitial.LastName combo on the same team (unlikely, but still).

Just figured I'd see if any of you have other ideas.

Answer Source

You need to add some comparison logic for names that allows for aliases. The precise organization would depend on what operations you wish to implement.

Here are two possible examples:

Check a name against known aliases for a specific person

Here, you narrow down the search to one or more persons by search terms (surname, team, years etc), then check which one's name your version matches.

A possible alias-handling interface here is:

check_name(test_name,suspected_person_id) -> boolean

Possible implementation:

create table aliases
alias varchar(max),
person_id int foreign key references persons;

The function would be checking test against rows with the appropriate person_id. Do include handling for a case where multiple persons match (i.e. an ambiguity) - at least, make the code throw an exception until you actually encounter such a case in practice and can decide what to do with it.

Guess "reference" name by a possible alias

Here, you need some reference database of alternate forms for given names, with whatever structure it requires.

Your function will be querying that database, then check if any names stored in your DB is in the result. The interface is almost the same, the difference is you don't need to store aliases yourself:

guess_name(test_name,possible_candidates) -> person_id or multiple ids/error if ambiguous