Sahil Gadimbayli Sahil Gadimbayli - 11 months ago 29
Ruby Question

Working with large dataset imports in Ruby/Rails

I'm currently working on a project with Ruby/Rails, importing invoices to the database but trying to maximise the efficiency of the processes which is indeed too slow right now.

For an import batch with 100.000 rows it takes around 2.5 3 hours to process and save each record in the database.

//// Ruby code

class DeleteImportStrategy
def pre_process(merchant_prefix, channel_import)
# channel needed to identify invoices so an import from another channel cannot collude if they had same merchant_prefix
Jzbackend::Invoice.where(merchant_prefix: merchant_prefix, channel: channel_import.channel).delete_all
# get rid of all previous import patches which becomes empty after delete_import_strategy
Jzbackend::Import.where.not(id: channel_import.id).where(channel: channel_import.channel).destroy_all
end

def process_row(row, channel_import)
debt_claim = Jzbackend::Invoice.new
debt_claim.import = channel_import
debt_claim.status = 'pending'
debt_claim.channel = channel_import.channel
debt_claim.merchant_prefix = row[0]
debt_claim.debt_claim_number = row[1]
debt_claim.amount = Monetize.parse(row[2])
debt_claim.print_date = row[3]
debt_claim.first_name = row.try(:[], 4)
debt_claim.last_name = row.try(:[], 5)
debt_claim.address = row.try(:[], 6)
debt_claim.postal_code = row.try(:[], 7)
debt_claim.city = row.try(:[], 8)
debt_claim.save
end


end

////

So for the each import batch that comes in as CSV, I get rid of previous batches and start to import new ones by reading each row and inserting it to the new Import as Invoice records. However, 2.5-3 hours for 100.000 entries seems a bit overkill. How can I optimise this process as i'm sure it's definitely not efficient this way.

Answer Source

First rule of mass imports: batch, batch, batch.

You're saving each row separately. This incurs HUGE overhead. Say, the insert itself takes 1ms, but the roundtrip to the database is 5ms. Total time used - 6ms. For 1000 records that's 6000ms or 6 seconds.

Now imagine that you use a mass insert, where you send data for multiple rows in the same statement. It looks like this:

INSERT INTO users (name, age)
VALUES ('Joe', 20), ('Moe', 22), ('Bob', 33'), ...

Let's say, you send data for 1000 rows in this one request. The request itself takes 1000ms (but in reality it'll likely be considerably quicker too, less overhead on parsing the query, preparing the execution plan, etc.). Total time taken is 1000ms + 5ms. At least 6x reduction! (in real projects of mine, I was observing 100x-200x reduction).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download