Mike Girard Mike Girard - 10 months ago 86
Python Question

Python/Psycopg2/PostgreSQL Copy_From loop gets slower as it progresses

I have written a Python script that takes a 1.5 G XML file, parses out data and feeds it to a database using copy_from. It invokes the following function every 1000 parsed nodes. There are about 170k nodes in all which update about 300k rows or more. It starts out quite fast and then gets progressively slower as time goes on. Any ideas on why this is happening and what I can do to fix it?

Here is the function where I feed the data to the db.

def db_update(val_str, tbl, cols):

conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>")
cur = conn.cursor()
output = cStringIO.StringIO()
cur.copy_from(output, tbl, sep='\t', columns=(cols))

I haven't included the xml parsing as I don't think that's an issue. Without the db the parser executes in under 2 minutes.

Answer Source

There are several things that can slow inserts as tables grow:

  • Triggers that have to do more work as the DB grows
  • Indexes, which get more expensive to update as they grow

Disable any non-critical triggers, or if that isn't possible re-design them to run in constant time.

Drop indexes, then create them after the data has been loaded. If you need any indexes for the actual INSERTs or UPDATEs you'll need to keep them an wear the cost.

If you're doing lots of UPDATEs, consider VACUUMing the table periodically, or setting autovacuum to run very aggressively. That'll help Pg re-use space rather than more expensively allocating new space from the file system, and will help avoid table bloat.

You'll also save time by not re-connecting for each block of work. Maintain a connection.