Lory Lory Lory Lory -4 years ago 94
SQL Question

PYMSSQL/SQL Server 2014: is there a limit to the length of a list of PKs to use as a subquery?

I have implemented a python script in order to divide millions of documents (generated by a .NET web application and which were all content into a single directory) into sub folders with this scheme: year/month/batch, as all the tasks these documents come from were originally divided into batches.
My python scripts performs queries to SQL Server 2014 which contains all data it needs for each document, in particular the month and year it was created in. Then it uses

shutil
module to move the pdf. So, I firstly perform a first query to get a list of batches, for a given month and year:

queryBatches = '''SELECT DISTINCT IDBATCH
FROM [DBNAME].[dbo].[WORKS]
WHERE YEAR(DATETIMEWORK)={} AND MONTH(DATETIMEWORK)={}'''.format(year, month)


Then I perform:

for batch in batches:
query = '''SELECT IDWORK, IDBATCH, NAMEDOCUMENT
FROM [DBNAME].[dbo].[WORKS]
WHERE NAMEDOCUMENTI IS NOT NULL and
NAMEDOCUMENT not like '/%/%/%/%.pdf' and
YEAR(DATETIMEWORK)={} and
MONTH(DATETIMEWORK)={} and
IDBATCH={}'''.format(year,month,batch[0])


whose records are collected into a cursor, according to PYMSSQL use documentation. So I go on with:

IDWorksUpdate = []
row = cursor.fetchone()
while row:

if moveDocument(...):
IDWorksUpdate.append(row[0])
row = cursor.fetchone()


Finally, when the cycle has ended, in
IDWorksUpdate
I have all the PKs of WORKS whose documents succeeded to be correctly moved into a subfolder. So, I close the cursor and the connection and I instantiate new ones.
In the end I perform:

subquery = '('+', '.join(str(x) for x in IDWorksUpdate)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(year,month,idbatch,subquery)

newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()

try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('Error on updating documents names in database of works {}/{} of batch {}'.format(year,month,idbatch))
finally:
newCursor.close()
del newCursor
newConn.close()


This morning I see that only for a couple of batches that update query failed executing at the database, even if the documents were correctly moved into subdirectories.
That batched had more than 55000 documents to be moved, so maybe the
IDWorksUpdate
overflowed and it helped to bad create that final update query? I thought that 55000 was not such a big list of integers. The problem is that in PYMSSQL we cannot have more than one connection/cursor at a time to the same database so I cannot update the records as the respective files are moved. So I thought to create a list of PKs of works whose documents were correctly moved and finally update them with a new connection/cursor. What could have happened? Am I doing it wrong?

UPDATE

I've just written a simple script to reproduce the query which is going to be executed to update the records, and this is the error I get from SQL Server:


The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.


This is the query:

UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE IDWORK IN (list of 55157 PKs)


The fact is that table is very big (with about 14 millions of records). But I need that list of PKs because only the tasks whose document have been correctly processed and moved can be updated. I cannot simply run:

UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE YEAR(DATETIMEWORK)=2016 and
MONTH(DATETIMEWORK)=12 and IDBATCH=1484


This because as our server was attacked by a crypto locker, I must process and move only the documents that still exist, waiting for the other to be released.
Should I split those string into sub lists? How?

UPDATE 2

It seems the following could be a solution: I split the list of PKs into chunks of 10000 (a fully experimental number) and then I execute as many queries as many chunks, each of them with a chunk as subquery.

def updateDB(listID, y, m, b, log):

newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()

if len(listID) <= 10000:

subquery = '('+', '.join(str(x) for x in listID)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORKIN {}'''.format(y,m,b,subquery)

try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('...')
log.write('\n\n')
finally:
newCursor.close()
del newCursor
newConn.close()
else:
chunksPK = [listID[i:i + 10000] for i in xrange(0, len(listID), 10000)]

for sublistPK in chunksPK:

subquery = '('+', '.join(str(x) for x in sublistPK)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(y,m,b,subquery)

try:
newCursor.execute(query)
newConn.commit()
except:
newConn.rollback()
log.write('Could not execute partial {}'.format(query))
log.write('\n\n')

newCursor.close()
del newCursor
newConn.close()


Could this be a good/secure solution?

Answer Source

As stated in the MSDN document

IN (Transact-SQL)

Explicitly including an extremely large number of values (many thousands of values separated by commas) within the parentheses, in an IN clause can consume resources and return errors 8623 or 8632. To work around this problem, store the items in the IN list in a table, and use a SELECT subquery within an IN clause.

(The error message you cited was error 8623.)

Putting the IN list values into a temporary table and then using

... WHERE IDWORK IN (SELECT keyValue FROM #inListTable)

strikes me as being more straightforward than the "chunking" method you described.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download