enaJ enaJ - 4 months ago 8
MySQL Question

sql: group by multiple correlated fields (date, weekday, month)

I am working on a SQL task. The goal is to know how many flights there are on average, for a given day in a given month from the flights table.

Input table:
flights

id BIGINT
dep_day_of_week varchar (255)
dep_month varchar (255)
dep_date text


An example of the flights table. There could be multiple entries for the same date.

id dep_day_of_week dep_month dep_date
1 Thursday January 4/7/2005 15:24:00
2 Friday February 5/6/2005 12:12:12
3 Friday February 5/6/2005 15:12:12


I read a solution as following:

SELECT a.dep_month,
a.dep_day_of_week,
AVG(a.flight_count) AS average_flights
FROM (
SELECT dep_month, dep_day_of_week, dep_date,
COUNT(*) AS flight_count
FROM flights
GROUP BY 1,2,3
) a
GROUP BY 1,2
ORDER BY 1,2;


My question is in the subquery which calculate the number of flights per day:

SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
FROM flights
GROUP BY 1,2,3


Since dep_month, dep_day_of_week, dep_date are three correlated attributes, with the dep_date might be the most detailed resolution of the three. So I thought
GROUP BY 1,2,3
will do the same function as
GROUP BY 3
.

To examine what could be the possible differences, I use
count(*) from ..
. to select all the terms resulted from the above subquery,

Select count(*) from (
SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
FROM flights
GROUP BY 1,2,3 or Group Group by 3)


In the output, the counts for
GROUP BY 1,2,3
and
GROUP BY 3
, are 447 and 441, respectively. Why there is any difference between these two grouping methods?

Updates:

Thanks to @trincot excellent answer. I use his suggested codes and found inconsistency in the input database.

SELECT dep_date, count(distinct dep_month), count(distinct dep_day_of_week)
FROM flights
GROUP BY dep_date
HAVING count(distinct dep_month) > 1
OR count(distinct dep_day_of_week) > 1


Output:

dep_date count(distinct dep_month) count(distinct dep_day_of_week)
1/16/2001 1 2
10/25/2003 1 2
2/23/2000 1 2
3/29/2001 1 2
4/3/2001 1 2
5/13/2000 1 2


Specifically, the database assigns Monday for
1/16/2001 8:25:00
and Tuesday for
1/16/2001 7:56:00
. That is the reason of the inconsistency.

Answer

As the date field has a time component, the count(*) in your subquery is going to be 1 every time, since the time component will be different and generate a new group. Your groups are actually per second.

You could get your results without subquery, like this:

select   dep_month,
         dep_day_of_week,
         count(*) /
             count(distinct substring_index(dep_date, ' ', 1)) avg_flights
from     flights
group by dep_month,
         dep_day_of_week

This counts all the flight records, and divides that by the number of different dates these flights are on. The date is extracted by only taking the part before the space.

Note that this means that when you don't have a record at all for a certain date, this day will not count in the average and might give a false impression. For instance, if in January there is only one Friday for which you have flights (let's say 10 of them), but there are 4 Fridays in January, you will still get an average of 10, even though 2.5 would be more reasonable.

About the difference in count

You state that this query returns 447 records:

Select count(*) from (
    SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
    FROM flights
    GROUP BY 1,2,3)

And this only 441:

Select count(*) from (
    SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
    FROM flights
    GROUP BY 3)

This seems to indicate that you have identical dates in multiple records, but yet with difference in one of the first two columns, which would be an inconsistency. You can find out with this query:

SELECT   dep_date, count(distinct dep_month), count(distinct dep_day_of_week)
FROM     flights
GROUP BY dep_date
HAVING   count(distinct dep_month) > 1
    OR   count(distinct dep_day_of_week) > 1

In a healthy data set, this query should return 0 records. If it returns records, you'll get the dates for which the month is not correctly set in at least one record, or the day of the week is not correctly set in at least one record.