m_dunn m_dunn - 5 months ago 6
SQL Question

Left Join: How to only use one value from left table on join when there are duplicates?

I have been reading a lot of answers on here about SQL joins, but I can't seem to find anything that fixes this problem. I have two tables that I want to join. The first (j_weather) is weather data. It includes a datetime field (called 'valid') with at least one entry per hour, but often more than one. The second table (j_collisions) contains police record data for traffic crashes. It has a field for date and time (called 'date_time'), which is inexplicably formatted as a string (I did not create the table). However I have been able to work around this so far, using substring functions (this is apparent in my SQL query below). Both date/time fields are formatted 'yyyy-MM-dd HH:ss'

The results I want are the number of traffic collisions for each hour in the dataset:

date_hour | collisions
-----------------|------------
2005-12-01 00:00 | 0
2005-12-01 01:00 | 0
2005-12-01 02:00 | 1
2005-12-01 03:00 | 1
2005-12-01 04:00 | 0
2005-12-01 05:00 | 1


If there are no collisions for that hour, it should return a zero. I can't just do a select from the j_collisions table because there aren't crashes for every hour in the analysis period, and I want every hour to show up even if it only shows zero crashes.

Here is the query I have constructed:

select
format(w.valid, 'yyyy-MM-dd HH') + ':00' as date_hour,
count(c.master_file_number) as collisions
from
hollings.dbo.j_weather as w
left join
hollings.dbo.j_collisions as c
on
format(w.valid, 'yyyy-MM-dd HH') = substring(c.date_time, 1, 13)
group by
format(w.valid, 'yyyy-MM-dd HH')
order by
date_hour


However, because as I mentioned above, some hours have multiple records for weather readings. So if, for instance, 03:00 on 2005-12-01 had three weather readings, my results would (incorrectly) read as follows, instead of the correct desired results above. Note the result of 3 collisions in the 03:00 hour instead of the correct 1 collision.

date_hour | collisions
-----------------|------------
2005-12-01 00:00 | 0
2005-12-01 01:00 | 0
2005-12-01 02:00 | 1
2005-12-01 03:00 | 3
2005-12-01 04:00 | 0
2005-12-01 05:00 | 1


I am a beginner with SQL, but I have tried as many options as I could think of/discover online. I apologize in advance if I missed any details, this is my first time asking a question on here. Any help would be greatly appreciated!

Answer

try with count(distinct c.master_file_number)

select
    format(w.valid, 'yyyy-MM-dd HH') + ':00' as date_hour,
    count(distinct c.master_file_number) as collisions
from
    hollings.dbo.j_weather as w
left join
    hollings.dbo.j_collisions as c
on
    format(w.valid, 'yyyy-MM-dd HH') = substring(c.date_time, 1, 13)
group by
    format(w.valid, 'yyyy-MM-dd HH')
order by
    date_hour