chingi chingi - 12 days ago 4x
Python Question

What is the best way to fetch huge data from mysql with sqlalchemy?

I want to process over 10 millions data stored in MySQL. So I wrote this to slice the sql to several parts then concatenate the data for latter process. It works well if

count < 2 millions
. However when the
rise, the time sqlalchemy consumes goes much longer.

def fetch_from_sql(_sql_pat, count):
:param _sql_pat: SELECT id, data FROM a.b LIMIT {},{};
:param count: how many data you want to fetch from mysql
:return: generator
def gen_connect(sql):
__engine = create_engine(db_config['SQLALCHEMY_DATABASE_URI'])
with __engine.connect() as c:
for row in c.execute(sql)
yield row

def gen_range(limit, step):
if step > limit:
yield 0, limit
R = range(0, limit + 1, step)
for idx, v in enumerate(R):
if idx == 0:
yield v, step
elif limit - v >= step:
yield v + 1, step
yield v + 1, limit - v

sqls = [_sql_pat.format(start, step) for start, step in gen_range(count, 100000)]
sources = (gen_connect(sql) for sql in sqls)
for s in sources:
for item in s:
yield item

The question is why the sqlalchemy take more and more time (I logged time and post below), and what is the best way to deal with this situation´╝č

Dumped 10000 items, at 2016-10-08 11:55:33
Dumped 1000000 items, at 2016-10-08 11:59:23
Dumped 2000000 items, at 2016-10-08 12:05:07
Dumped 3000000 items, at 2016-10-08 13:54:05


This is because you're using LIMIT/OFFSET, so when you specify offset 3000000, for example, the database has to skip over 3000000 records.

The correct way to do this is to ORDER BY some indexed column, like the primary key id column, for example, then do a WHERE id > :last_fetched_id.