juwiley juwiley - 10 months ago 51
Ruby Question

Significant overhead for using threads in Ruby for metric collection?

I have a server that collects and analyzes usage metrics. I want various pieces of my architecture to periodically send metrics to the server via a REST API.

I don't want to block executing while the metrics are being transmitted, so I've considered spinning off creating a method that will spin off threads:

require 'net/http'

module Metrics
def self.time(time_to_process)
Thread.new do
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
# ...do a bunch of setup...
response = http.request(request)

...and inside the application.

def app_method
# ...do stuff, measure time

Since the application code is single threaded, and it takes a second or two for app_method to execute, I don't anticipate having more than 10-100 metric collection threads in action at anyone time so OS thread limits are not a big concern.

However I'm wondering what the overhead in terms of memory and CPU time to spin off the new thread (not counting the memory/CPU required to actually do the Net::HTTP call)? Is there a significant downside to this approach?

Answer Source

The short answer is YES - firing up a new Thread ad-hoc has memory and CPU overhead which are very significant!

The industry standard to avoiding creating a new thread whenever you want to do a background job is using thread pools, which are simply a number of threads created in advance, waiting to receive messages, and do the work accordingly.

Looking at similar solutions (like newrelic's), most use a background process (or agent) which is in charge of actually sending the information to the server, while the application sends light-weight messages to the agent, which it aggregates and bulk-sends at its convenience.

In a rails system, building a background job from scratch is not recommended, and you should consider using gems like sidekiq, along with its suggested architecture, to do this for you. Most of those don't depend on Threads within the main application either, but on their own processes (sometimes on their own machines), communicating with the application with messages on a queue (using a repository like Redis for example).