user2075215 user2075215 - 7 months ago 15
SQL Question

SQL where two fields are similar using "GROUP BY"

I have a database of names, some of the names haven't been insert in the correct fashion. SecondName has sometimes been entered as FirstName.

+-----------------+--------------+
| FirstName | SecondName |
+-----------------+--------------+
| Bob | Smith |
| Gary | Rose |
| John | Jones |
| Smith | Bob |
| Gary | Oberstein |
| Adam | Sorbet |
| Jones | John |
+-----------------+--------------+


I've tried different grouping queries

select `FirstName`
, `SecondName`
from `names`
where ( `FirstName`
, `SecondName` )
in ( select `FirstName`
, `SecondName`
from `names`
group
by `FirstName`
, `SecondName`
having count(*) > 1
)


But I can't get anything to produce

+-----------------+--------------+---------+
| FirstName | SecondName | Count |
+-----------------+--------------+---------+
| Bob | Smith | 2 |
| John | Jones | 2 |
+-----------------+--------------+---------+

Answer

There is a trick to do this, you need to normalize your names, a quick way to do this is if you alphabetize first name and last name then group on the result.

SELECT name_normalized, count(*) as c
FROM (
  SELECT CASE WHEN FIRSTNAME < LASTNAME THEN FIRSTNAME||LASTNAME
              ELSE LASTNAME|| FIRSTNAME END as name_normalized
  FROM names
) X
GROUP BY name_normalized

Notes:

  • This is the simple case, you could add the normalized result as a column if you want to see the original values.
  • You may need other normalization -- it depends on what your rules are. For example UPPER() to ignore case and TRIM() to remove whitespace.
  • You can add or ignore other columns as is required for matching normalization -- Birthday, Middle Initial etc.
  • Oten time a hash on the normalized string is faster to work with than the string -- your data model might require one or the other.
Comments