John Baum John Baum - 1 month ago 8
SQL Question

Using group by and self join to find duplicates

I have a table that i am attempting to query which looks like this:

t_documents
id
user_id
submitted_date
text
status


A user can have multiple documents in the documents table and is free to submit the same text multiple times. I wanted a way to see how many duplicate submissions are made per user. So for ex:

VALUES (1, 1234, 2016-07-05, "this is a test", 3)
VALUES (2, 1234, 2016-07-06, "this is a test", 3)
VALUES (3, 5678, 2016-07-07, "this is another test", 3)
VALUES (4, 5678, 2016-07-08, "this is another test", 3)


For the above set of data, I wanted a result that says give me a record for user 1234, the duplicate text and the number of times the duplicate text has been submitted. I have tried the following:

select oring.user_id, orig.text, COUNT(1) as dups
from t_documents orig
join t_documents another
on orig.user_id = another.user_id
and orig.text = another.text
group by user_id


The above is super rough and does not work. Can anyone please advise how to do what I want? Another query I would be interested in would be, in total, how many duplicate entries are there across all the users?

Answer

I'm not sure that you need a self-join here. A simple GROUP BY on the user ID and text columns should suffice:

SELECT user_id, COUNT(*) AS dup_count
FROM t_documents
GROUP BY user_id, text

I assume here that you don't care about the posting date when determining whether a text be duplicate or not.

Edit:

If you want to find the number of duplicate texts which were posted, by any user at any time, then you can try the following query:

SELECT COUNT(*) - COUNT(DISTINCT text)
FROM t_documents
GROUP BY text
Comments