David Heyman David Heyman - 3 months ago 22
Python Question

SQLAlchemy emitting cross join for no reason

I had a query set up in SQLAlchemy which was running a bit slow, tried to optimize it. The result, for unknown reason, uses an implicit cross join, which is both significantly slower and comes up with entirely the wrong result. I’ve anonymized the table names and arguments but otherwise made no changes. Does anyone know where this is coming from?

To make it easier to find: The differences in new and old emitted SQL are that the new one has a longer SELECT and mentions all three tables in the WHERE before any JOINs.

Original code:

cust_name = u'Bob'
proj_name = u'job1'
item_color = u'blue'
query = (db.session.query(Item.name)
.join(Project, Customer)
.filter(Customer.name == cust_name,
Project.name == proj_name)
.distinct(Item.name))

# some conditionals determining last filter, resolving to this one:
query = query.filter(Item.color == item_color)

result = query.all()


Original emitted SQL as logged by flask_sqlalchemy.get_debug_queries:

QUERY: SELECT DISTINCT ON (items.name) items.name AS items_name
FROM items JOIN projects ON projects.id = items._project_id JOIN customers ON customers.id = projects._customer_id
WHERE customers.name = %(name_1)s AND projects.name = %(name_2)s AND items.color = %(color_1)s
Parameters: `{'name_2': u'job1', 'state_1': u'blue', 'name_1': u'Bob'}


New code:

cust_name = u'Bob'
proj_name = u'job1'
item_color = u'blue'
query = (db.session.query(Item)
.options(Load(Item).load_only('name', 'color'),
joinedload(Item.project, innerjoin=True).load_only('name').
joinedload(Project.customer, innerjoin=True).load_only('name'))
.filter(Customer.name == cust_name,
Project.name == proj_name)
.distinct(Item.name))

# some conditionals determining last filter, resolving to this one:
query = query.filter(Item.color == item_color)

result = query.all()


New emitted SQL as logged by flask_sqlalchemy.get_debug_queries:

QUERY: SELECT DISTINCT ON (items.nygc_id) items.id AS items_id, items.name AS items_name, items.color AS items_color, items._project_id AS items__project_id, customers_1.id AS customers_1_id, customers_1.name AS customers_1_name, projects_1.id AS projects_1_id, projects_1.name AS projects_1_name
FROM customers, projects, items JOIN projects AS projects_1 ON projects_1.id = items._project_id JOIN customers AS customers_1 ON customers_1.id = projects_1._customer_id
WHERE customers.name = %(name_1)s AND projects.name = %(name_2)s AND items.color = %(color_1)s
Parameters: `{'state_1': u'blue', 'name_2': u'job1', 'name_1': u'Bob'}


In case it matters, the underlying database is PostgreSQL.

The original intent of the query only needs
Item.name
. The optimization attempt is looking less likely to actually be helpful the longer I think about it, but I still want to know where that cross-join came from in case it happens again somewhere that adding
joinedload
,
load_only
, etc. would actually help.

Answer

This is because a joinedload is different from a join. The joinedloaded entities are effectively anonymous, and the later filters you applied refer to different instances of the same tables, so customers and projects gets joined in twice.

What you should do is to do a join as before, but use contains_eager to make your join look like joinedload.

query = (session.query(Item)
                .join(Item.project)
                .join(Project.customer)
                .options(Load(Item).load_only('name', 'color'),
                         Load(Item).contains_eager("project").load_only('name'),
                         Load(Item).contains_eager("project").contains_eager("customer").load_only('name'))
                .filter(Customer.name == cust_name,
                        Project.name == proj_name)
                .distinct(Item.name))

This gives you the query

SELECT DISTINCT ON (items.name) customers.id AS customers_id, customers.name AS customers_name, projects.id AS projects_id, projects.name AS projects_name, items.id AS items_id, items.name AS items_name, items.color AS items_color 
FROM items JOIN projects ON projects.id = items._project_id JOIN customers ON customers.id = projects._customer_id 
WHERE customers.name = %(name_1)s AND projects.name = %(name_2)s AND items.color = %(color_1)s