JNYRanger JNYRanger - 3 months ago 8
SQL Question

Query to Select a "More Complete" Duplicate Record

I'm trying to write a SQL query that will select a "more complete" duplicate. For example if our table contains the following 4 fields: Name, Phone, Email, CompanyId, and both Name & CompanyId are equal for two records, but one of them contains a Phone and/or Email and the other record has these two fields blank, then only the completed record will be available in the recordset.

I thought I had the query down, but if a "more complete" record doesn't exist then that record is either left out completely or remained duplicated in the recordset (depending on how I modify the query). I'm not sure if there's a better way of doing this that will not cause any data loss/duplication. Right now all I can think of is to

UNION
a bunch more other queries, but I feel like there has to be a better way. I'm using a CTE to get the raw data and then here's what I'm trying to do to manipulate it to remove duplicates:

;WITH Contacts AS (
-- LONG QUERY HERE THAT UNIONS NECESSARY Tables & XML Columns AND returns a recordset with Name, Email, Phone, and ClientId
)
SELECT u.* FROM Contacts u
INNER JOIN (
SELECT Name, ClientId, Count(*) AS ct FROM Contacts
GROUP BY Name, ClientId
HAVING COUNT(*) > 1
) AS g
ON u.Name = g.Name AND u.ClientId = g.ClientId
WHERE Phone IS NOT NULL AND Email IS NOT NULL
UNION
SELECT u.* FROM Contacts u
INNER JOIN (
SELECT Name, ClientId, Count(*) AS ct FROM Contacts
GROUP BY Name, ClientId
HAVING COUNT(*) = 1
) AS h
ON u.Name = h.Name AND u.ClientId = h.ClientId


If unioning a few more queries is what is needed that's not a big deal, but it seems like there's probably a better way of handling this. Any suggestions?

EDIT: Sample Data

Name | Phone | Email | ClientId
------------------------------------------------------
Person1 | NULL | NULL | 42
Person1 | 555-555-5555 | test@blah.org | 42
Person2 | NULL | NULL | 21
Person2 | NULL | NULL | 21
Person3 | 555-555-5555 | NULL | 79
Person3 | NULL | NULL | 79
Person4 | 555-555-5555 | NULL | 49
Person4 | NULL | test@blah.org | 49
Person5 | 555-555-5555 | NULL | 91
Person5 | 555-555-5555 | test@blah.org | 91
Person6 | 555-555-5555 | NULL | 91


Returned Dataset After Query --

Name | Phone | Email | ClientId
------------------------------------------------------
Person1 | 555-555-5555 | test@blah.org | 42
Person2 | NULL | NULL | 21
Person3 | 555-555-5555 | NULL | 79
Person4 | 555-555-5555 | test@blah.org | 49
Person5 | 555-555-5555 | test@blah.org | 91
Person6 | 555-555-5555 | NULL | 91


The merging of data for Person4 is the ideal situation, but not necessarily the answer I'm looking for in this SO question. In that one situation where name OR email are populated I'd be OK with duplicates as long as there isn't any data loss.

Answer

This will work for sure..

select 
name,max(phone),max(email),clientid
from
Table t
group by name,clientid
Comments