Tobbe Tobbe - 5 months ago 12
SQL Question

Duplicate rows when joining three tables

I'm using SQL Server 2014, and I've got a problem with a query. I've got three tables. A

Report
consists of ten each of
ClothingObservation
and
HygieneObservation
. The way I do this is by referencing the
ReportId
of
Report
in ten rows each of the two types of observations, for 20 observations per report in total. I want to select all the rows of one report. When I try to do this, I get 100 rows. My goal is to get 10 rows, or 20 rows with
NULL
values. This is for testing purposes at the moment, so
Report
contains just 1 row, and
ClothingObservation
and
HygieneObservation
contains 10 rows each, all referencing the
ReportId
of the one existing report.

My tables, details omitted for clarity:

CREATE TABLE HygieneObservation
(
HygieneObservationId int PRIMARY KEY IDENTITY NOT NULL,
...
ReportId int NOT NULL
)

CREATE TABLE ClothingObservation
(
ClothingObservationId int PRIMARY KEY IDENTITY NOT NULL,
...
ReportId int NOT NULL
)

CREATE TABLE Report
(
ReportId int PRIMARY KEY IDENTITY NOT NULL,
Period Date NOT NULL,
Reporter nvarchar(8) NOT NULL,
DepartmentId int NOT NULL
)


My query:

SELECT
Report.ReportId,
Report.Period,
Report.Reporter,
Report.DepartmentId,

ClothingObservation.ClothingObservationId,
HygieneObservation.HygieneObservationId

FROM Report

LEFT JOIN ClothingObservation ON
(ClothingObservation.ReportId = Report.ReportId)
LEFT JOIN HygieneObservation ON
(HygieneObservation.ReportId = Report.ReportId)

GROUP BY
Report.ReportId,
Period,
Reporter,
DepartmentId,

ClothingObservation.ClothingObservationId,
HygieneObservation.HygieneObservationId


This gives me 100 rows, which I understand is because each row in
ClothingObservation
is matched to each row in
HygieneObservation
. I thought that using
GROUP BY
would cause duplicates to be removed, but I'm obviously doing something wrong. Any hints?

Edit: Here's my data right now (details omitted).

Report:


ReportId Period Reporter DepartmentId
----------- ---------- -------- ------------
1 2016-05-01 username 1


ClothingObservation:


ClothingObservationId ... ReportId
--------------------- ... -----------
1 ... 1
2 ... 1
3 ... 1
4 ... 1
5 ... 1
6 ... 1
7 ... 1
8 ... 1
9 ... 1
10 ... 1


HygieneObservation:


HygieneObservationId ... ReportId
-------------------- ... -----------
3 ... 1
4 ... 1
5 ... 1
6 ... 1
7 ... 1
8 ... 1
9 ... 1
10 ... 1
12 ... 1
13 ... 1


Edit 2: If I run these two queries, I get my desired output (again, irrelevant details omitted from result):

SELECT * FROM Report
LEFT JOIN ClothingObservation ON
(ClothingObservation.ReportId = Report.ReportId)
SELECT * FROM Report
LEFT JOIN HygieneObservation ON
(HygieneObservation.ReportId = Report.ReportId)

ReportId Period Reporter DepartmentId ClothingObservationId ... ReportId
----------- ---------- -------- ------------ --------------------- ...- -----------
1 2016-05-01 username 1 1 ... 1
1 2016-05-01 username 1 2 ... 1
1 2016-05-01 username 1 3 ... 1
1 2016-05-01 username 1 4 ... 1
1 2016-05-01 username 1 5 ... 1
1 2016-05-01 username 1 6 ... 1
1 2016-05-01 username 1 7 ... 1
1 2016-05-01 username 1 8 ... 1
1 2016-05-01 username 1 9 ... 1
1 2016-05-01 username 1 10 ... 1

ReportId Period Reporter DepartmentId HygieneObservationId ... ReportId
----------- ---------- -------- ------------ -------------------- ... -----------
1 2016-05-01 username 1 3 ... 1
1 2016-05-01 username 1 4 ... 1
1 2016-05-01 username 1 5 ... 1
1 2016-05-01 username 1 6 ... 1
1 2016-05-01 username 1 7 ... 1
1 2016-05-01 username 1 8 ... 1
1 2016-05-01 username 1 9 ... 1
1 2016-05-01 username 1 10 ... 1
1 2016-05-01 username 1 12 ... 1
1 2016-05-01 username 1 13 ... 1


My goal is to get this output (or something like it) with one query.

Answer

What is happening is that joining Report (1 row) to ClothingObservation (10 rows) produces 10 row (1 x 10), you then join to HygieneObservation (10 rows) which gives you 100. The reason this is happening is because after the initial join you have 10 rows with the same ReportID so the next join takes each of these 10 rows and joins to the 10 rows in HygieneObservation.

The solution for "20 rows with NULL values":

SELECT 
    Report.ReportId, 
    Report.Period, 
    Report.Reporter, 
    Report.DepartmentId, 
    ClothingObservation.ClothingObservationId,
     NULL AS HygieneObservationId
FROM Report
LEFT JOIN ClothingObservation ON
    (ClothingObservation.ReportId = Report.ReportId)
UNION ALL
SELECT 
    Report.ReportId, 
    Report.Period, 
    Report.Reporter, 
    Report.DepartmentId, 
    NULL AS ClothingObservationId,
    HygieneObservation.HygieneObservationId
FROM Report
LEFT JOIN HygieneObservation ON
    (HygieneObservation.ReportId = Report.ReportId)

How it works:
You essentially write two separate queries: one that join Report and ClothingObservation and another that joins Report to HygieneObservation. You then combine the two queries with UNION ALL.

The solution for "get 10 rows"

This is complex as it involves what I call "vertical merging" or "Merge Join". Below is the query (Update: I have tested it).

SELECT 
    Report.ReportId, 
    Report.Period, 
    Report.Reporter, 
    Report.DepartmentId, 

    MergedObservations.ClothingObservationId,
    MergedObservations.HygieneObservationId
FROM Report
    LEFT JOIN 
            ( SELECT COALESCE( ClothingObservation.ReportID, HygieneObservation.ReportID ) AS ReportID,
                    HygieneObservationID, ClothingObservationID -- Add appropriate columns
            FROM
                        ( SELECT ROW_NUMBER() OVER( PARTITION BY ReportID ORDER BY ClothingObservationID ) AS ResultID, ReportID, ClothingObservationID
                        FROM ClothingObservation ) AS ClothingObservation
                FULL OUTER JOIN
                        ( SELECT ROW_NUMBER() OVER( PARTITION BY ReportID ORDER BY HygieneObservationID  ) AS ResultID, ReportID, HygieneObservationID
                        FROM HygieneObservation ) AS HygieneObservation
                    ON ClothingObservation.ReportID = HygieneObservation.ReportID
                        AND ( ClothingObservation.ResultID = HygieneObservation.ResultID OR ClothingObservation.ResultID IS NULL OR HygieneObservation.ResultID IS NULL )
            ) AS MergedObservations
        ON Report.ReportID = MergedObservations.ReportID

How it works:
Because ClothingObservation and HygieneObservationId are not directly related to each other and have differing number of rows per ReportID, I use a ROW_NUMBER() function to generate a join key. I then do a "Merge Join" using ReportID and the output of the ROW_NUMBER() function.

Sample Data

I have converted your sample data into a usable table data to test above queries.

CREATE TABLE Report( ReportId INT, Period DATETIME, Reporter VARCHAR( 20 ), DepartmentId INT )
CREATE TABLE ClothingObservation( ClothingObservationID INT, ReportId INT )
CREATE TABLE HygieneObservation( HygieneObservationID INT, ReportId INT )

INSERT INTO Report
VALUES( 1, '2016-05-01', 'username', 1 )

INSERT INTO ClothingObservation
VALUES
( 1, 1 ), ( 2, 1 ), ( 3, 1 ), ( 4, 1 ), ( 5, 1 ), ( 6, 1 ), ( 7, 1 ), ( 8, 1 ), ( 9, 1 ), ( 10, 1 )

INSERT INTO HygieneObservation
VALUES
( 3, 1 ), ( 4, 1 ), ( 5, 1 ), ( 6, 1 ), ( 7, 1 ), ( 8, 1 ), ( 9, 1 ), ( 10, 1 ), ( 11, 1 ), ( 12, 1 ), ( 13, 1 )