pjw23 pjw23 - 7 days ago 5
Ruby Question

Use Ruby Mechanize to scrape all successive pages

I'm looking for assistance on the best way to loop through successive pages on a website while scraping relevant data off of each page.

For example, I want to go to a specific site (craigslist in below example), scrape the data from the first page, go to the next page, scrape all relevant data, etc, until the very last page.

In my script I'm using a

while
loop since it seemed to make the most sense to me. However, it doesn't appear to be working properly and is only scraping data from the first page.

Can someone familiar with Ruby/Mechanize point me in the right direction on what the best way to accomplish this task is. I've spent countless hours trying to figure this out and feel like I'm missing something very basic.

Thanks in advance for your help.

require 'mechanize'
require 'pry'

# initialze
agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari'}
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

# Create an empty array to dump contents into
property_results = []

# Scrape all successive pages from craigslist
while page.link_with(:dom_class => "button next") != nil
next_link = page.link_with(:dom_class => "button next")
page.css('ul.rows').map do |d|
property_hash = { title: d.at_css('a.result-title.hdrlnk').text }
property_results.push(property_hash)
end
page = next_link.click
end





UPDATE:
I found this, but still no dice:
Ruby Mechanize: Follow a Link




@pguardiario, thank you so much for your help. I'm definitely getting closer. The only problem now is that instead of the results being a single array of hash values, it contains multiple arrays with a lot of
~ ~ ~
values. Any ideas how to fix this. When I run a more advanced version of this it gives an error of

automate.rb:45:in `block in <main>': undefined method `text' for nil:NilClass (NoMethodError)
from /home/pjw/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from /home/pjw/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:186:in `upto'
from /home/pjw/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:186:in `each'
from automate.rb:43:in `map'
from automate.rb:43:in `<main>'


Here's my updated script based on your input.

require 'mechanize'
require 'httparty'
require 'pry'

# initialze
agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari'}
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

#create Empty Array
property_results = []

# Scrape all successive pages from craigslist
while link = page.at('[rel=next]')
page.css('ul.rows').map do |d|
property_hash = { title: d.at_css('a.result-title.hdrlnk').text }
property_results.push(property_hash)
end
link = page.at('[rel=next]')
page = agent.get link[:href]
end
pry(binding)

Answer

Whenever you see a [rel=next], that's the thing you want to follow:

page = agent.get url
do_something_with page
while link = page.at('[rel=next]')
  page = agent.get link[:href]
  do_something_with page
end