steph steph - 9 days ago 6
Ruby Question

Extracting RSS link with Nokogiri

I am using Nokogiri to extract the RSS link from a webpage. However, since some websites have absolute paths and others relative on their HTML, I wanted to make it so that if the website has a relative path it will be made absolute.

Here is my code:

require 'nokogiri'
require 'simple-rss'
require 'open-uri'


ARGV.map! { |http| "http://#{http}"}
ARGV.each do |website|
doc = Nokogiri::HTML(open(website))
rss_path = doc.xpath("//link[@type=\"application/rss+xml\"]").map do |link|
if link['href'] =~ /^http:\/\/[a-z]*\..*\//i
puts link['href']
else
puts "#{website}#{link['href']}"
end
end


So if I was on command line, I would type something like

ruby rss.rb 8gramgorilla.com rubyweekly.com


The code works fine for rubyweekly.com which has a relative path for its RSS but 8gramgorilla.com has an absolute path and so I would want it to just be output immediately, not have http://8gramgorilla.com/http://8gramgorilla.com/feed as the output. Basically, what's going on is that the IF statement is being ignored and it goes right away to the else statement.

Answer

The if statement isn’t being ignored, it is evaluating to false. Your regexp is /^http:\/\/[a-z]*\..*\//i, so it is looking for http:// followed by any number of a-z (or a . since zero a-z will also match). But the website url is http://8gramgorilla.com, the first character is the digit 8, which isn’t in the range a-z.

The most direct fix to this would be to change your regex to include digits, perhaps something like /^http:\/\/[\da-z]*\..*\//i (where \d has been added).

You might be able to simplify the regex more, perhaps simply checking to see if the url matches http:// at the start would be enough.

A more robust solution would be to properly parse the url in question, perhaps using the Addressable gem or the URI module in Ruby’s standard lib.

Comments