I have written a Python script that takes a 1.5 G XML file, parses out data and feeds it to a database using copy_from. It invokes the following function every 1000 parsed nodes. There are about 170k nodes in all which update about 300k rows or more. It starts out quite fast and then gets progressively slower as time goes on. Any ideas on why this is happening and what I can do to fix it?
Here is the function where I feed the data to the db.
def db_update(val_str, tbl, cols):
conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>")
cur = conn.cursor()
output = cStringIO.StringIO()
cur.copy_from(output, tbl, sep='\t', columns=(cols))
There are several things that can slow inserts as tables grow:
Disable any non-critical triggers, or if that isn't possible re-design them to run in constant time.
Drop indexes, then create them after the data has been loaded. If you need any indexes for the actual
UPDATEs you'll need to keep them an wear the cost.
If you're doing lots of
VACUUMing the table periodically, or setting autovacuum to run very aggressively. That'll help Pg re-use space rather than more expensively allocating new space from the file system, and will help avoid table bloat.
You'll also save time by not re-connecting for each block of work. Maintain a connection.