In a database containing several hundreds of thousands of names (separated on firstnames and lastnames), is there a way to perform analysis to identify uncommon names? The purposes is to produce test data that represents a the full width of ways names can be written.
I guess there's many ways to define uncommon, so what I'm looking for here are both ideas for what to look for, and how to find them. For instance, have people included their title in their name? Non-ascii characters? Leading whitespace? Do people have several first names? Last names? etc.
You can use the below (informal) query and keep extend/Customize it based your need,
SELECT *
FROM ( SELECT 'LeadingWhiteSpaces' AS Flaw ,
*
FROM Emp
WHERE FirstName <> LTRIM(RTRIM(FirstName))
UNION
SELECT 'Non-ASCII' AS Flaw ,
*
FROM Emp
WHERE FirstName LIKE '%[^a-z ,-.^0-9]%'
UNION
SELECT 'TitlesInName' AS Flaw ,
*
FROM Emp
WHERE LTRIM(RTRIM(FirstName)) LIKE 'MR.%'
OR LTRIM(RTRIM(FirstName)) LIKE 'MRs.%'
OR LTRIM(RTRIM(FirstName)) LIKE 'Jr.%'
UNION
SELECT 'SeveralFirstNames' AS Flaw ,
*
FROM Emp
WHERE LTRIM(RTRIM(FirstName)) LIKE '% %'
) A
ORDER BY Flaw