ivkovmg ivkovmg - 3 months ago 15
SQL Question

MS Access - Remove duplicates based on three columns while keeping one result without using SQL Min/Max

I have been having an issue for a little bit of time now and cannot find a solution that has worked for me. It might be that I am just not doing it correctly or that there is an alternative that will work better. I am opened to, and appreciative, of any ideas.

I have a table (tblDocQueue) in access that is like the one displayed below. The data comes from a data extract of an older application that we use at work, so the source or the extract cant be changed. We upload the data to Access to analyze and build metrics around it. The tables is as follows:

ID DocName OwnerName AccountNum DocRef

1 Doc 1 Matt 1001 Z0005638

2 Doc 1 Matt 1002 Z0005638

3 Doc 1 Tony 5010 Z0005639

4 Doc 2 Luke 1050 Z0005640

5 Doc 3 Luke 1050 Z0005641

6 Doc 3 Gary 1234 Z0005641

7 Doc 4 John 8789 Z0005642

8 Doc 5 Ed 8789 Z0005642

9 Doc 5 Ed 8790 Z0005643

10 Doc 5 Connie 4579 Z0005644

11 Doc 6 Mary 3616 Z0005645

12 Doc 6 Lucy 4795 Z0005646

13 Doc 6 Tina 4795 Z0005646

14 Doc 7 Matt 1001 Z0005638

15 Doc 7 John 8789 Z0005647


There are more columns than what are listed, but they are not relevant to the question. I am trying to remove duplicates, keeping one unique value, based on three columns (DocName, OwnerName, Doc Ref). I was doing this using the following SQL, but it began taking hours (up to 7) to process around 500,000 lines of data. I am unsure if the efficiency problem is because of using min/max or some other reason

SELECT tblDocQueue.ID AS Expr1, tblDocQueue.DocName AS Expr2,
tblDocQueue.OwnerName AS Expr3, tblDocQueue.AcctNo AS Expr4,
tblDocQueue.ExpDate AS Expr5, tblDocQueue.EffectiveDate AS Expr6,
tblDocQueue.SignatureDate AS Expr7, tblDocQueue.DocBNYSts AS Expr8,
tblDocQueue.StsDate AS Expr9, tblDocQueue.UserSts AS Expr10,
tblDocQueue.Location AS Expr11, tblDocQueue.Ackngmt AS Expr12,
tblDocQueue.OPID AS Expr13, tblDocQueue.Comments AS Expr14,
tblDocQueue.DocRef AS Expr15, tblDocQueue.ExternalComment AS Expr16,
tblDocQueue.FirstName AS Expr17, tblDocQueue.LastName AS Expr18,
tblDocQueue.ClientID AS Expr19, tblDocQueue.Address AS Expr20,
tblDocQueue.CountryCode AS Expr21
FROM tblDocQueue
WHERE ((([tblDocQueue].[ID])=(
SELECT Min(t.[ID])
FROM [tblDocQueue] AS t
WHERE t.[DocRef]=[tblDocQueue].[DocRef]
AND t.[DocName]=[tblDocQueue].[DocName])));


The time that this it was taking was unacceptable for the business team. I then developed a work around in VBA that exports the data to an excel file, using the remove duplicates function preloaded in excel, and imported the unique data back into a different table. This only takes a few seconds in excel. As the use of this database is beginning to expand, and I will removing duplicates in a similar way for hundreds of datasets a day, I am trying to get this to work without using the above workaround.

The anticipated result of the above example data would be:

ID DocName OwnerName AccountNum DocRef

1 Doc 1 Matt 1001 Z0005638

3 Doc 1 Tony 5010 Z0005639

4 Doc 2 Luke 1050 Z0005640

5 Doc 3 Luke 1050 Z0005641

6 Doc 3 Gary 1234 Z0005641

7 Doc 4 John 8789 Z0005642

8 Doc 5 Ed 8789 Z0005642

9 Doc 5 Ed 8790 Z0005643

10 Doc 5 Connie 4579 Z0005644

11 Doc 6 Mary 3616 Z0005645

12 Doc 6 Lucy 4795 Z0005646

13 Doc 6 Tina 4795 Z0005646

14 Doc 7 Matt 1001 Z0005638

15 Doc 7 John 8789 Z0005647


If anyone can help me with the SQL to:


  1. Conditionally remove duplicate values based on three columns

  2. Using Microsoft Access 2010

  3. While keeping one row of each unique value

  4. In some way that is efficient / does not take a long amount of time (tables up to 5,000,000 records)



Any and all help is greatly appreciated!

Answer

Consider a straightforward aggregate query, grouped on the columns you need distinct DocName, OwnerName, DocRef. Then, take the Min() of the ID and AccountNum which would also include all the other columns:

SELECT Min(tblDocQueue.ID) AS MinOfID, 
       tblDocQueue.DocName, 
       tblDocQueue.OwnerName, 
       Min(tblDocQueue.AccountNum) AS MinOfAccountNum, 
       tblDocQueue.DocRef
FROM tblDocQueue
GROUP BY tblDocQueue.DocName, 
         tblDocQueue.OwnerName, 
         tblDocQueue.DocRef
ORDER BY Min(tblDocQueue.ID);

Output

MinOfID   DocName   OwnerName   MinOfAccountNum DocRef
1         Doc 1     Matt        1001            Z0005638
3         Doc 1     Tony        5010            Z0005639
4         Doc 2     Luke        1050            Z0005640
5         Doc 3     Luke        1050            Z0005641
6         Doc 3     Gary        1234            Z0005641
7         Doc 4     John        8789            Z0005642
8         Doc 5     Ed          8789            Z0005642
9         Doc 5     Ed          8790            Z0005643
10        Doc 5     Connie      4579            Z0005644
11        Doc 6     Mary        3616            Z0005645
12        Doc 6     Lucy        4795            Z0005646
13        Doc 6     Tina        4795            Z0005646
14        Doc 7     Matt        1001            Z0005638
15        Doc 7     John        8789            Z0005647
Comments