dan arters dan arters - 5 months ago 5
Python Question

Efficiency of checking for string in querystring parameters in python

I am running a query that needs to exclude all previously seen items and return a certain number of items (10 in this case). The list of items is sent in a querystring parameter and are uuid4 strings separated by '&'.

With a possible very large database, I don't think it makes sense to add the exclude statement with the query as most of the results wont be in the alreadySeenItems list and the database is pretty hefty.

Which of the following methods would be faster, assuming that the alreadySeenList can be pretty large >1000 items in some case.

Note: I would normally just use the first one, but since I need an exact match and I know where each word starts, it might make sense to do otherwise

def getItems(request, alreadySeenItems):
newItems = []
allPossibleItems = # unrelated heinous sql query

# method 1
for item in allPossibleItems:
if item not in alreadySeenItems:
newItems.append(item)
if len(newItems) > 10:
return newItems

# method 2
alreadySeenItemsList = alreadySeenItems.split('&')
for item in allPossibleItems:
if not checkForItemInList(item, alreadySeenItems)
newItems.append(item)
if len(newItems) > 10:
return newItems


def checkForItemInList(item, items):
for tmp in items:
if item == tmp:
return True
return False

Answer

Whatever you are doing will be much faster if you use Python's built in set data structure. Checking for inclusion then is very fast, if you have some memory available, but 1k items is nothing. see https://docs.python.org/2/library/stdtypes.html#set

Not to dodge the original question, but we also can't be sure if your database is truly large. Have you added indexes to your table? As far as most of the results being of one nature or another leading to some inefficiency, some systems (Postgresql at least) allow for "partial indexing" which allows you to build an index on just some values, which can make things a little leaner - but I'm not completely sure if this is relevant to you.

Comments