MacMac MacMac - 1 year ago 110
MySQL Question

GROUP BY does not remove duplicates

I have a watchlist system that I've coded, in the overview of the users' watchlist, they would see a list of records, however the list shows duplicates when in the database it only shows the exact, correct number.

I've tried

GROUP BY watch.watch_id
GROUP BY rec.record_id
, none of any types of group I've tried seems to remove duplicates. I'm not sure what I'm doing wrong.

SELECT watch.watch_date,
watchlist watch

LEFT OUTER JOIN records rec ON rec.record_id = watch.record_id

LEFT OUTER JOIN members usr ON rec.user_id = usr.user_id
WHERE watch.user_id = 1
GROUP BY watch.watch_id
LIMIT 0, 25

The watchlist table looks like this:

| watch_id | user_id | record_id | watch_date |
| 13 | 1 | 22 | 1314038274 |
| 14 | 1 | 25 | 1314038995 |

Answer Source

GROUP BY does not "remove duplicates". GROUP BY allows for aggregation. If all you want is to combine duplicated rows, use SELECT DISTINCT.

If you need to combine rows that are duplicate in some columns, use GROUP BY but you need to to specify what to do with the other columns. You can either omit them (by not listing them in the SELECT clause) or aggregate them (using functions like SUM, MIN, and AVG). For example:

SELECT watch.watch_id, COUNT(rec.street_number), MAX(watch.watch_date)
... GROUP by watch.watch_id


The OP asked for some clarification.

Consider the "view" -- all the data put together by the FROMs and JOINs and the WHEREs -- call that V. There are two things you might want to do.

First, you might have completely duplicate rows that you wish to combine:

a b c
- - -
1 2 3
1 2 3
3 4 5

Then simply use DISTINCT


a b c
- - -
1 2 3
3 4 5

Or, you might have partially duplicate rows that you wish to combine:

a b c
- - -
1 2 3
1 2 6
3 4 5

Those first two rows are "the same" in some sense, but clearly different in another sense (in particular, they would not be combined by SELECT DISTINCT). You have to decide how to combine them. You could discard column c as unimportant:


a b
- -
1 2
3 4

Or you could perform some kind of aggregation on them. You could add them up:

SELECT a,b, SUM(c) "tot" FROM V GROUP BY a,b;

a b tot
- - ---
1 2 9
3 4 5

You could add pick the smallest value:

SELECT a,b, MIN(c) "first" FROM V GROUP BY a,b;

a b first
- - -----
1 2 3
3 4 5

Or you could take the mean (AVG), the standard deviation (STD), and any of a bunch of other functions that take a bunch of values for c and combine them into one.

What isn't really an option is just doing nothing. If you just list the ungrouped columns, the DBMS will either throw an error (Oracle does that -- the right choice, imo) or pick one value more or less at random (MySQL). But as Dr. Peart said, "When you choose not to decide, you still have made a choice."

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download