krazynhazy krazynhazy - 3 months ago 14
Python Question

How do I multithread SQL Queries in python such that I obtain the results of all of the queries

Is there a way to use threads to simultaneously perform the SQL queries so I can cut down on processing time of my code below? Is there a better method to perform the same result as below without using the pandas module? Given the size of the data sets I am working with I cannot store the entire dataset in memory and I have found looping over the rows of a SELECT * FROM statement and comparing them against the list I am querying with adds to the processing time.

# DATABASE layout
# _____________________________________________________________
# | id | name | description |
# |_____________|____________________|__________________________|
# | 1 | John | Credit Analyst |
# | 2 | Jane | Doctor |
# | ... | ... | ... |
# | 5000000 | Mohammed | Dentist |
# |_____________|____________________|__________________________|

import sqlite3


SEARCH_IDS = [x for x in range(15000)]
DATABASE_NAME = 'db.db'

def chunks(wholeList, chunkSize=999):
"""Yield successive n-sized chunks from wholeList."""
for i in range(0, len(wholeList), chunkSize):
yield wholeList[i:i + chunkSize]

def search_database_for_matches(listOfIdsToMatch):
'''Takes a list of ids and returns the rows'''
conn = sqlite3.connect(DATABASE_NAME)
cursor = conn.cursor()
sql = "SELECT id, name, description FROM datatable WHERE id IN ({})".format(', '.join(["?" for x in listOfIdsToMatch]))
cursor.execute(sql,tuple(listOfIdsToMatch))
rows = cursor.fetchall()
return rows

def arrange(orderOnList,listToBeOrdered,defaultReturnValue='N/A'):
'''Takes a list of ids in the desired order and list of tuples which have ids as the first items.
the list of tuples is aranged into a new list corresponding to the order of the source list'''
from collections import OrderedDict
resultList=[defaultReturnValue for x in orderOnList]
indexLookUp = OrderedDict( [ ( value , key ) for key , value in enumerate( orderOnList ) ] )
for item in listToBeOrdered:
resultList[indexLookUp[item[0]]]=item
return resultList


def main():
results=[]
for chunk in chunks(SEARCH_IDS,999):
results += search_database_for_matches(chunk)
results = arrange(SEARCH_IDS,results)
print(results)


if __name__ == '__main__': main()

Answer

Some advices:

Instead of reading the records by chucks using a iterator, you ought to use pagination.

See this questions:

If you're using multithreading / multiprocessing make sure your database can support it. See: SQLite And Multiple Threads

To implement what you want you can use a pool of workers which work on each chunk. See Using a pool of workers in the Python documentation.

Example:

Import multiprocessing 

with multiprocessing.pool.Pool(process = 4) as pool:
    result = pool.map(search_database_for_match, [for chunk in chunks(SEARCH_IDS,999)])