bkocik bkocik - 1 month ago 12
Ruby Question

Bizarre ActiveRecord issues - like generating invalid SQL

Recently we deployed a new version of our app, and since then we've been seeing some really weird issues with ActiveRecord. For example, here's a snippet of a query it generates hundreds of times per day, usually correctly:

`entries`.`style` AS t1_r25, `entries`.`pdf_visibility` AS , `entries`.`web_visibility` AS t1_r27


That's not a typo, t1_r26 is missing there although there's a space where it should be. But only that one time. That's not hand-written SQL, either, that's ActiveRecord writing the query and deciding on all the placeholder variables. It has similarly botched other queries leaving things blank that shouldn't be blank (shouldn't even be possible), but only once in a while. Most of the time it's fine.

We're also seeing a lot of instances where it complains about things like table_alias or reflection being an undefined variable or method on false:FalseClass. That's true...but the thing that is a FalseClass should have been an ActiveRecord model. We have no clue how any of this is happening, or how we could possibly have written a bug in our Rails code that would do most of this (especially the invalid query above).

We're on Rails 4.1.16 (we upgraded from 4.1.8 when this started happening) with Ruby 2.2.0 in Passenger 5.0.26 (going to 5.0.30 next). These errors are extremely sporadic and none of them make any sense. Out of thousands of requests per day, only a small handful of them (less than 10 across 5 servers) result in one of these weird errors, and we can't purposely reproduce any of them.

My entire team is stumped. We've spent hours poring over code changes and can't see anything that might cause any of this. We don't even know what we could possibly have written that would cause ActiveRecord to sometimes write a bad query in a way that we shouldn't be able to affect. We have no idea how to begin troubleshooting this kind of thing. Does anyone out there have a hint that might point us in some useful direction?

Update: Here's a new one it threw this morning. Note that LibraryItem is one of our pretty straightforward ActiveRecord models:

NoMethodError: undefined method `__callbacks' for #<LibraryItem:0x007f66cc5b82b0>


I...have no idea.

Answer

To close the loop for those who tried to help and for anyone who stumbles into this: We cured it by upgrading MRI. We'd been running on 2.2.0 for around a year, which was why we didn't immediately suspect it, and also because this started with a particular deployment. I was tipped off when we saw a couple of errors about an inability to allocate memory, and when MRI exploded in a hail of shrapnel on one server (by which I mean it segfaulted) and took Passenger down with it.

From there I started looking at MRI changelogs and noticed a ton of memory and GC related bug fixes between 2.2.0 and 2.2.5. Last night we upgraded to 2.2.5 with a deployment, and (fingers crossed) we haven't seen a single one of these weird issues yet. (Previously we were seeing 12-20 per day or more, depending on traffic).

So, why did it start happening following a deployment for us? I don't know for sure, but I have a guess: I'm thinking the size in bytes of our application in memory finally hit some critical mass at which it started triggering one or more of the MRI bugs that were fixed between 2.2.0 and 2.2.5. Best I can come up with.

Huge thanks to those who stepped in to try to assist!

Comments