cd98 cd98 - 1 year ago 52
Python Question

NTILE for Sqlite from Pandas gives OPERATIONAL ERROR

I'm trying to use the NTILE function for querying a SQLite database from pandas, but I haven't succeeded, even though I've rechecked the syntax many times.

Self-contained example below.

import pandas as pd
from sqlalchemy import create_engine
disk_engine = create_engine('sqlite:///test.db')

marks = pd.DataFrame({'StudentID': ['S1', 'S2', 'S3', 'S4', 'S5'],
'Marks': [75, 83, 91, 83, 93]})
marks.to_sql('marks_sql', disk_engine, if_exists='replace')

Now try to use NTILE:

q = """select StudentID, Marks, NTILE(2) OVER (ORDER BY Marks DESC)
AS groupexample FROM marks_sql"""
pd.read_sql_query(q, disk_engine)

The traceback is long, but it's main parts are:

OperationalError: near "(": syntax error
OperationalError: (sqlite3.OperationalError) near "(": syntax error [SQL: 'select StudentID, Marks, NTILE(2) OVER (ORDER BY Marks DESC)\n AS groupexample FROM marks_sql']



There is no NTILE () OVER functionality in SQLITE

Gives me the same error, need to create this using more complex query or functions

Here is a list of unsupported analytical functions not available in SQLITE

NTILE is one of these

The optimizer goes inside the query first to find OVER, which it thinks is a column name and does not expect ( to follow a column name, so gives you this error.

To replicate NTILE try this:

select * ,
    (select count(*)+0.0 from marks_sql b where table.Marks >= b.Marks)
    /(select count(*) from marks_sql ) >0.5 
  then 1 
  else 2 end
from marks_sql;

In order to do this in such a way that the table can grow in size and this technique still applies we have do a few things:

So first we order the table by Marks (essentially create a ranking). This counts the rows with higher or equal Marks:

select count(*)+0.0 from marks_sql b where table.Marks >= b.Marks  --rank of Mark

We add 0.0 to make this number a float so our fraction works in the next step.

We then take the rank and divide it by the total row count

select count(*) from marks_sql -- row count

This gives us a distribution over the range of scores, the percentile for each Student. But we do not care about each exact percentile, we care about the NTILE(2) or whether they are in the top half.

That is where the CASE statement comes into play. If the percentile of the student is over 50% they fall in the #1 group, the top 50th percentile. Everyone else falls in the #2 group.