sealocal sealocal - 7 months ago 23
Ruby Question

How do I optimize an ActiveRecord find_in_batches query?

I'm using Rails 4.0.0 and Ruby 2.0.0. My

Post
(as in blog posts) model is associated with a user with a combination of the user's user_name, first_name, last_name. I'd like to migrate the data so that posts are associated to users by a foreign key, which is the user's id.

I have about 11 million records in the
posts
table.

I'm running the below code to migrate the data, using a rake task on a Linux server. However, my task keeps getting "Killed" by the sever, presumably due to the rake task, specifically the below code, consuming too much memory.

I've found that lowering the
batch_size
to 20 and increasing
sleep(10)
to
sleep(60)
allows the task to run longer, updating more records in total without being Killed, but takes significantly more time.

How can I optimize this code for speed and memory usage?

Post.where(user_id: nil).find_in_batches(batch_size: 1000) do |posts|
puts "*** Updating batch beginning with post #{posts.first.id}..."
sleep(10) # Hopefully, saving some memory usage.
posts.each do |post|
begin
user = User.find_by(user_name: post.user_name, first_name: post.first_name, last_name: post.last_name)
post.update(user_id: user.id)
rescue NoMethodError => error # user could be nil, so user.id will raise a NoMethodError
puts "No user found."
end
end
puts "*** Finished batch."
end

z5h z5h
Answer

Do all the work in the database which is WAY faster than moving data back and forth.

This can be accomplished with ActiveRecord. Of course PLEASE test this before you unleash it on important data.

Post
  .where(user_id: nil)
  .joins("inner join users on posts.user_name = users.user_name")
  .update_all("posts.user_id = users.id")