edumike edumike - 2 months ago 20
Python Question

Python parse SQL and find relationships

I've got a large list of SQL queries all in strings, they've been written for Presto, so kinda formatted for MySQL.

I want to be able to tease out the table relationships written in some queries.

Let's start with something simple:

SELECT e.object_id, count(*)
FROM schema_name.elements AS e
JOIN schema_name2.quotes AS q ON q.id = e.object_id
WHERE e.object_type = 'something'
GROUP BY e.object_id, q.query
ORDER BY 2 desc;


Can clearly see where things join together, though there's aliases - so would need to scan through and find the aliases too - that's fine as the keyword "AS" is used.

So I'd want to have returned a list of relationships for the query, each relationship would look something like this dict:

dict = {'SourceSchema': 'schema_name',
'SourceTable': "elements",
'SourceColumn': "object_id",
'TargetSchema': "schema_name2",
'TargetTable': "quotes",
'TargetColumn': "id"}


I can imagine that doing that is pretty easy, but stuff gets more complicated:

SELECT e.object_id, count(*)
FROM schema_name.elements e
LEFT JOIN schema_name2.quotes q ON q.id = cast(coalesce(nullif(e.object_id,''),'0') as bigint)
WHERE e.object_type = 'something'
GROUP BY e.object_id, q.query
ORDER BY 2 desc;


3 things to note


  • Missing "AS" reserved word - could make it harder to get

  • When joining, there's a lot of stuff needed to parse the two tables together

  • This isn't a simple "JOIN" it's a left join



I'm wondering if there's some form of SQL Parsing library for Python that will allow me to tease out the relationships in some 4000 queries? And if not, then how could I do this efficiently? I'm guessing I might need to scan through queries, find the joins, find the alias, then look at how they're joined whilst taking into account a bunch of stop words that need to be discarded.

Answer

With some minor changes to the select_parser.py (https://sourceforge.net/p/pyparsing/code/HEAD/tree/trunk/src/examples/select_parser.py) that is part of the pyparsing examples, I get this after parsing your first example:

SELECT e.object_id, count(*) FROM schema_name.elements AS e        JOIN schema_name2.quotes AS q ON q.id = e.object_id WHERE e.object_type = 'something' GROUP BY e.object_id, q.query ORDER BY 2 desc;
['SELECT', [['e.object_id'], ['count', '*']], 'FROM', [['schema_name', '.', 'elements'], 'AS', 'e', ['JOIN'], ['schema_name2', '.', 'quotes'], 'AS', 'q', ['ON', ['q.id', '=', 'e.object_id']]], 'WHERE', ['e.object_type', '=', 'something'], 'GROUP', 'BY', [['e.object_id'], ['q.query']], 'ORDER', 'BY', [['2', 'DESC']], ';']
- columns: [['e.object_id'], ['count', '*']]
  [0]:
    ['e.object_id']
  [1]:
    ['count', '*']
- from: [[['schema_name', '.', 'elements'], 'AS', 'e', ['JOIN'], ['schema_name2', '.', 'quotes'], 'AS', 'q', ['ON', ['q.id', '=', 'e.object_id']]]]
  [0]:
    [['schema_name', '.', 'elements'], 'AS', 'e', ['JOIN'], ['schema_name2', '.', 'quotes'], 'AS', 'q', ['ON', ['q.id', '=', 'e.object_id']]]
    - table_alias: [['e'], ['q']]
      [0]:
        ['e']
      [1]:
        ['q']
- order_by_terms: [['2', 'DESC']]
  [0]:
    ['2', 'DESC']
    - direction: DESC
    - order_key: 2
- where_expr: ['e.object_type', '=', 'something']

So it looks like this example might help you get started. It was written to the SELECT format for SQLite, so you'll need to expand some of the syntax.