In Apache Hive I have to tables I would like to left-join keeping all the data from the left data and adding data where possible from the right table.
For this I use two joins, because the join is based on two fields (a material_id and a location_id).
This works fine with two traditional left joins:
INNER JOIN (some more complex select) b
ON a.material_id=b.material_id AND a.location_id=
CASE WHEN a.location_id = b.location_id THEN b.location_id ELSE ...;
| material_id | location_id | other_column_a |
| 100 | 1 | 45 |
| 101 | 1 | 45 |
| 103 | 1 | 45 |
| 103 | 2 | 45 |
| material_id | location_id | other_column_b |
| 100 | 1 | 66 |
| 102 | 1 | 76 |
| 103 | 2 | 88 |
Left - Join Table
| material_id | location_id | other_column_a | other_column_b
| 100 | 1 | 45 | 66
| 101 | 1 | 45 | NULL (mat. not in b)
| 103 | 1 | 45 | DEFAULT TO where location_id=2 (88)
| 103 | 2 | 45 | 88
The solution is to left join without
a.location_id = b.location_id and number all rows in order of preference. Then filter by row_number. In the code below the join will duplicate rows first because all matching material_id will be joined, then
row_number() function will assign 1 to rows where
a.location_id = b.location_id and 2 to rows where
a.location_id <> b.location_id if exist also rows where
a.location_id = b.location_id and 1 if there are not exist such.
b.location_id added to the
order by in the row_number() function so it will "prefer" rows with lower
b.location_id in case there are no exact matching. I hope you have caught the idea.
select * from ( SELECT a.*, b.*, row_number() over(partition by material_id order by CASE WHEN a.location_id = b.location_id THEN 1 ELSE 2 END, b.location_id ) as rn FROM a LEFT JOIN (some more complex select) b ON a.material_id=b.material_id )s where rn=1 ;