Homan Homan - 2 months ago 9
SQL Question

How to write this query to avoid cartesian product?

I want to create a CSV export for orders showing the warehouse_id where each order_item had shipped from, if available.

For brevity, here is the pertinent schema:

create table o (id integer);


orders have many order_items:

create table oi (id integer, o_id integer, sku text, quantity integer);


For each order_item in the CSV we want to show a warehouse_id from where it shipped out of. But that is not stored in order_items. It is stored in the shipment.

An order can be split up into many shipments from potentially from different warehouses.

create table s (id integer, o_id integer, warehouse_id integer);


shipments have many shipment items too:

create table si (id integer, s_id integer, oi_id integer, quantity_shipped integer);


How do I extract the warehouse_id for each order_item, given that warehouse_id is on the shipment and not every order has shipped yet (may not have a shipment record or shipment_items).

We are doing something like this (simplified):

select oi.sku, s.warehouse_id from oi
left join s on s.o_id = oi.o_id;


However if an order has 2 order items, let's call them sku A and B. And that order was split into two shipments where A was shipped from warehouse '50' and then a second shipment shipped B from '200'.

What we want would be a CSV output like:

sku | warehouse_id
-----|--------------
A | 50
B | 200


But what we get is some kind of cartesian product:

=================================

Here is the sample data:

select * from o;
id
----
1
(1 row)

select * from oi;
id | o_id | sku | quantity
----+------+-----+----------
1 | 1 | A | 1
2 | 1 | B | 1
(2 rows)

select * from s;
id | o_id | warehouse_id
----+------+--------------
1 | 1 | 50
2 | 1 | 200
(2 rows)

select * from si;
id | s_id | oi_id
----+------+------
1 | 1 | 1
2 | 2 | 2
(2 rows)

select oi.sku, s.warehouse_id from oi left join s on s.o_id = oi.o_id;
sku | warehouse_id
-----+--------------
A | 50
A | 200
B | 50
B | 200
(4 rows)


UPDATE ========

Per spencer, I'm adding a different example with different pk ids for more clarity. The following is 2 example orders. Order 2 has items A,B,C. A,B are shipped from shipment 200, C is shipped from shipment 201. Order 3 has 2 items E and A. E is not yet shipped and A is shipped twice out of the same warehouse '700', (like it was on back order).

# select * from o;
id
----
2
3
(2 rows)

# select * from oi;
id | o_id | sku | quantity
-----+------+-----+----------
100 | 2 | A | 1
101 | 2 | B | 1
102 | 2 | C | 1
103 | 3 | E | 1
104 | 3 | A | 2
(5 rows)

# select * from s;
id | o_id | warehouse_id
-----+------+--------------
200 | 2 | 700
201 | 2 | 800
202 | 3 | 700
203 | 3 | 700
(4 rows)

# select * from si;
id | s_id | oi_id
-----+------+-------
300 | 200 | 100
301 | 200 | 101
302 | 201 | 102
303 | 202 | 104
304 | 203 | 104
(5 rows)


I think this works, I use left join to keep the order_items in the report no matter if the order has shipped or not, I use group by to squash multiple shipments from the same warehouse. I believe this is what I need.

# select oi.o_id, oi.id, oi.sku, s.warehouse_id from oi left join si on si.oi_id = oi.id left join s on s.id = si.s_id group by oi.o_id, oi.id, oi.sku, s.warehouse_id order by oi.o_id;
o_id | id | sku | warehouse_id
------+-----+-----+--------------
2 | 102 | C | 800
2 | 101 | B | 700
2 | 100 | A | 700
3 | 104 | A | 700
3 | 103 | E |
(5 rows)

Answer

Order items that have shipped ...

SELECT oi.id
     , oi.sku
     , s.warehouse_id
  FROM oi
  JOIN si ON si.oi_id = oi.id
  JOIN s  ON s.id     = si.s_id

Order items that haven't yet shipped, using anti-join to exclude rows where there is a matching row in si

SELECT oi.id
     , oi.sku
     , s.warehouse_id
  FROM oi
  JOIN s ON s.o_id = oi.o_id      -- fk to fk shortcut join
    -- anti-join
  LEFT
  JOIN si ON si.oi_id = oi.id
 WHERE si.oi_id IS NULL

But this will still produce a (partial) Cartesian product. We can add a GROUP BY clause to collapse the rows...

 GROUP BY si.oi_id

This doesn't avoid producing an intermediate cartesian product; the addition of the GROUP BY clause collapses the set. But it's indeterminate which of matching rows from s column values will be returned from.

The two queries could be combined with a UNION ALL operation. If I did that, I'd likely add a discriminator column (an additional column in each query with different values, which would tell which query returned a row.)

This set might meet the specification outlined in the OP question. But I don't think this is really the set that needs to be returned. Figuring out which warehouse an item should ship from may involve several factors... total quantity ordered, quantity available in each warehouse, can order be fulfilled from one warehouse, which warehouse is closer to delivery destination, etc.

I don't want to leave anyone with the impression that this query is really a "fix" for the cartesian product problem... this query just hides a bigger problem.

Comments