zam6ak zam6ak - 4 months ago 9
SQL Question

How to include missing data for multiple groupings within the time span?

I have below referenced query which groups studies counts by teacher, study year-month, and room for the past 12 months (including current month). The result I get is correct, however, I would like to include rows with zero counts for when the data is missing.

I looked at several other related posts but could not get desired output:



Here is the query:

SELECT
upper(trim(t.full_name)) AS teacher
, date_trunc('month', s.study_dt)::date AS study_month
, r.room_code AS room
, COUNT(1) AS study_count
FROM
studies AS s
LEFT OUTER JOIN rooms AS r ON r.id = s.room_id
LEFT OUTER JOIN teacher_contacts AS tc ON tc.id = s.teacher_contact_id
LEFT OUTER JOIN teachers AS t ON t.id = tc.teacher_id
WHERE
s.study_dt BETWEEN now() - interval '13 month' AND now()
AND s.study_dt IS NOT NULL
GROUP BY
teacher
, study_month
, room
ORDER BY
teacher
, study_month
, room;


The output I get:

"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-12-01","A2",1
"DOE, JOHN","2016-01-01","B1",1
"SIMPSON, HOMER","2016-05-01","B2",3
"MOUSE, MICKEY","2015-08-01","A2",1
"MOUSE, MICKEY","2015-11-01","B1",1
"MOUSE, MICKEY","2015-11-01","B2",2


But I want count of 0 to show for all missing year-month and room combinations. For example (just first rows, there are 4 rooms in all: A1, A2, B1, B2):

"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-07-01","A2",0
"DOE, JOHN","2015-07-01","B1",0
"DOE, JOHN","2015-07-01","B2",0
...
"DOE, JOHN","2015-12-01","A1",1
"DOE, JOHN","2015-12-01","A2",0
"DOE, JOHN","2015-12-01","B1",0
"DOE, JOHN","2015-12-01","B2",0
...


To get the missing year-months, I tried left outer join on using time series and joining on
time_range.year_month = study_month
, but it didn't work.

SELECT date_trunc('month', time_range)::date AS year_month
FROM generate_series(now() - interval '13 month', now() ,'1 month') AS time_range


So, I'd like to know how to 'fill in the gaps' for

a) both year-month and room and, as a bonus:
b) just a year-month.

The reason for this is that the dataset would be fed to a pivot library to that we can get an output similar to following (could not do this in PG directly):

teacher,room,2015-07,...,2015-12,...,2016-07,total
"DOE, JOHN",A1,1,...,1,...,0,2
"DOE, JOHN",A2,0,...,0,...,0,0
...and so on...

Answer

Based on some assumptions (ambiguities in the question) I suggest:

SELECT t.teacher
     , m.study_month
     , r.room_code      AS room
     , count(s.room_id) AS study_count

FROM  (SELECT id, upper(trim(full_name)) AS teacher FROM teachers) t
CROSS  JOIN generate_series(date_trunc('month', now() - interval '12 month')  -- 12!
                          , date_trunc('month', now())
                          , interval '1 month') m(study_month)
CROSS  JOIN (SELECT id, room_code FROM rooms) r

LEFT   JOIN (     studies          s                                   -- parentheses!
             JOIN teacher_contacts tc ON tc.id = s.teacher_contact_id  -- INNER JOIN!
            ) ON tc.teacher_id = t.id
             AND s.study_dt >= m.study_month
             AND s.study_dt <  m.study_month + interval '1 month'      -- sargable!
             AND s.room_id = r.id
GROUP  BY t.teacher, t.id, m.study_month, r.room_code
ORDER  BY t.teacher, t.id, m.study_month, r.room_code;

Major points

  • Build a grid of all desired combinations with CROSS JOIN. And then LEFT JOIN to existing rows. Related:

  • In your case, it's a join of several tables, so I use parentheses in the FROM list to LEFT JOIN to the result of INNER JOIN within the parentheses. It would be incorrect to LEFT JOIN to each table separately, because you would include hits on partial matches and get potentially incorrect counts.

  • Assuming referential integrity and working with PK columns directly, we don't need to include rooms and teachers on the left side a second time. But we still have a join of two tables (studies and teacher_contacts). The role of teacher_contacts is unclear to me. Normally, I would expect a relationship between studies and teachers directly. Might be further simplified ...

  • We need to count a non-null column on the left side to get the desired counts. Like count(s.room_id)

  • To keep this fast for big tables, make sure your predicates are sargable.

  • The column teacher is hardly (reliably) unique. Operate with a unique ID, preferably the PK (faster and simpler, too). I am still using teacher for the output to match your desired result. It might be wise to include a unique ID, since names can be duplicates.

  • You want:

    the past 12 months (including current month).

    So start with date_trunc('month', now() - interval '12 month' (not 13). That's rounding down the start already and does what you want - more accurately than your original query.


About your closing remark:

the dataset would be fed to a pivot library ... (could not do this in PG directly)

Chances are you can use the two-parameter form of crosstab() to produce your desired result directly and with excellent performance and the above query is not needed to begin with. Consider:

Comments