Ben Ben - 5 months ago 30
Ruby Question

Sidekiq mechanize overwritten instance

I am building a simple web spider using Sidekiq and Mechanize.

When I run this for one domain, it works fine. When I run it for multiple domains, it fails. I believe the reason is that

web_page
gets overwritten when instantiated by another Sidekiq worker, but I am not sure if that's true or how to fix it.

# my scrape_search controller's create action searches on google.
def create
@scrape = ScrapeSearch.build(keywords: params[:keywords], profession: params[:profession])
agent = Mechanize.new
scrape_search = agent.get('http://google.com/') do |page|
search_result = page.form...
search_result.css("h3.r").map do |link|
result = link.at_css('a')['href'] # Narrowing down to real search results
@domain = Domain.new(some params)
ScrapeDomainWorker.perform_async(@domain.url, @domain.id, remaining_keywords)
end
end
end


I'm creating a Sidekiq job per domain. Most of the domains I'm looking for should contain just a few pages, so there's no need for sub-jobs per page.

This is my worker:

class ScrapeDomainWorker
include Sidekiq::Worker
...

def perform(domain_url, domain_id, keywords)
@domain = Domain.find(domain_id)
@domain_link = @domain.protocol + '://' + domain_url
@keywords = keywords

# First we scrape the homepage and get the first links
@domain.to_parse = ['/'] # to_parse is an array of PATHS to parse for the domain
mechanize_path('/')
@domain.verified << '/' # verified is an Array field containing valid domain paths
get_paths(@web_page) # Now we should have to_scrape populated with homepage links

@domain.scraped = 1 # Loop counter
while @domain.scraped < 100
@domain.to_parse.each do |path|
@domain.to_parse.delete(path)
@domain.scraped += 1
mechanize_path(path) # We create a Nokogiri HTML doc with mechanize for the valid path
...
get_paths(@web_page) # Fire this to repopulate to_scrape !!!
end
end
@domain.save
end

def mechanize_path(path)
agent = Mechanize.new
begin
@web_page = agent.get(@domain_link + path)
rescue Exception => e
puts "Mechanize Exception for #{path} :: #{e.message}"
end
end

def get_paths(web_page)
paths = web_page.links.map {|link| link.href.gsub((@domain.protocol + '://' + @domain.url), "") } ## This works when I scrape a single domain, but fails with ".gsub for nil" when I scrape a few domains.
paths.uniq.each do |path|
@domain.to_parse << path
end
end

end


This works when I scrape a single domain, but fails with
.gsub for nil
for
web_page
when I scrape a few domains.

Answer

That's correct! Your worker is overwriting your class instance variables because it's not thread safe.

You can wrap you code in another class, and then create and object of that class within your worker:

class ScrapeDomainWrapper
  def initialize(domain_url, domain_id, keywords)
    # ...
  end

  def mechanize_path(path)
    # ...
  end

  def get_paths(web_page)
    # ...
  end
end

And your worker:

class ScrapeDomainWorker
  include Sidekiq::Worker

  def perform(domain_url, domain_id, keywords)
    ScrapeDomainWrapper.new(domain_url, domain_id, keywords)
  end
end