Jeff Widman Jeff Widman - 7 months ago 138
Python Question

Is it faster to batch my inserts when using SQLAlchemy bulk_insert_mappings()?

I've got 500K rows I want to insert into PostgreSQL using SQLAlchemy.

For speed, I'm inserting them using

session.bulk_insert_mappings()
.

Normally, I'd break up the insert into smaller batches to minimize
session
bookkeeping. However,
bulk_insert_mappings()
uses
dicts
and bypasses a lot of the traditional session bookkeeping.

Will I still see a speed improvement if I break the insert up into smaller discrete batches, say doing an insert every 10K rows?

If so, should I close the PG transaction after every 10K rows, or leave it open the whole time?

Answer

In my experience, you'll see substantial performance improvements if you use INSERT INTO tbl VALUES (...), (...), ...; as opposed to bulk_insert_mappings, which uses executemany. In this case you'll want to batch the rows at least on a statement level for sanity.

Committing between batches probably won't have much of an effect on the performance, but the reason to do it would be to not keep an open transaction for too long, which could impact other transactions running on the server.

You can also experiment with using COPY to load it into a temporary table, then INSERTing from that table.

Comments