Manonthemoon Manonthemoon - 9 days ago 6
Ruby Question

Refactoring Ruby scrape code (with different attributes)

I'm scraping ~10 websites for the same information, and currently have a script for each one of them that works on its own. These scripts all have the same base (iterate over available pages, scrape information, save it), but different attributes.

As an example, these are examples of how I'm extracting the

author
element from two pages:

page.at('b[itemprop="author"]').children.text.strip
page.at('.author-username').text.strip


My goal is to refactor this so the main logic is handled by in a class, but I'm having trouble figuring out how to pass in the above extractors depending on the source. I'm aware that I can pass CSS selectors as arguments, but as you can see there is some additional logic for each extraction.

While I could have a separate method to handle this (as outlined in the previous link), this would quickly get out of hand with ~10 sources.

What is the best way to refactor this code?

Answer

I would probably go with a hash.

Assuming there's not too much detail, put it all in a sort of Rosetta Stone hash that supplies the relevant info for each page. This can be used in conjunction with a case...when statement to load the relevant details.

Something like:

site_attributes = {
  site_1: ['attribute_1', 'attribute_2', ... ],
  site_2: ['attribute_3', 'attribute_4', ... ],
  ...
}

It may need to be a little more complex if you need to call different methods on the results of different attributes. Then your array of attributes for each site would need to be hashes instead of strings. Something like:

[
  {
    attr: 'attribute_1',
    methods: ['children', 'text', 'strip']
  }, {
    attr: 'attribute_2',
    methods: ['text', 'strip']
  },
  ...
]

Then you can each through the attributes, use them with page.at(), and iteratively call the additional methods on the result.

Comments