Rubyx Rubyx - 1 month ago 6
Ruby Question

How avoid interval with Mechanize

I'm trying to scrape Craiglist with Mechanize. I code this:

require 'mechanize'

a = Mechanize.new
page = a.get("http://paris.craigslist.fr/search/apa")
i = 0
list_per_page = 99
while i <= list_per_page do
title = page.search(".hdrlnk")[i].text
price = page.search(".price")[i].text
puts title
puts price
puts "-----------"
i+=1
end


It works but when a listing hasn't any price there is an interval. I think it's because I use
search()[i]
but I don't know what I have to do to avoid interval. Any idea?

Edit:

On Craiglist there is:

listing_title1 -> $100
listing_title2 -> $200
listing_title3 ->
listing_title4 -> $60
listing_title5 -> $150


My output CSV displays:

listing_title1 -> $100
listing_title2 -> $200
listing_title3 -> $60
listing_title4 -> $150
listing_title5 -> $300


$300 is listing_title6

Answer

If by 'interval' you mean the blank line that is printed when the listing doesn't have a price, you could fix this by making the puts conditional:

puts price unless price.empty?

Edit

If I understand right, your hdrlnk and price entries are getting out of sync with each other. This happens because your current loop is skipping entries with blank price fields and going straight to the next one.

The best way to get around this is to find a container that includes both price and hdrlnk and iterate over those instead of over the hdrlnk and price entries separately. On this page that would be the .row which contains all the info for each search result. So something like this would work:

page.search(".row").each do |row|
  title = row.search(".hdrlnk").first
  price = row.search(".price").first
  puts title.text if title
  puts price.text if price
  puts "------------"
end
Comments