steph steph - 2 months ago 39
Ruby Question

Extracting RSS link with Nokogiri

I am using Nokogiri to extract the RSS link from a webpage. However, since some websites have absolute paths and others relative on their HTML, I wanted to make it so that if the website has a relative path it will be made absolute.

Here is my code:

require 'nokogiri'
require 'simple-rss'
require 'open-uri'! { |http| "http://#{http}"}
ARGV.each do |website|
doc = Nokogiri::HTML(open(website))
rss_path = doc.xpath("//link[@type=\"application/rss+xml\"]").map do |link|
if link['href'] =~ /^http:\/\/[a-z]*\..*\//i
puts link['href']
puts "#{website}#{link['href']}"

So if I was on command line, I would type something like

ruby rss.rb

The code works fine for which has a relative path for its RSS but has an absolute path and so I would want it to just be output immediately, not have as the output. Basically, what's going on is that the IF statement is being ignored and it goes right away to the else statement.


The if statement isn’t being ignored, it is evaluating to false. Your regexp is /^http:\/\/[a-z]*\..*\//i, so it is looking for http:// followed by any number of a-z (or a . since zero a-z will also match). But the website url is, the first character is the digit 8, which isn’t in the range a-z.

The most direct fix to this would be to change your regex to include digits, perhaps something like /^http:\/\/[\da-z]*\..*\//i (where \d has been added).

You might be able to simplify the regex more, perhaps simply checking to see if the url matches http:// at the start would be enough.

A more robust solution would be to properly parse the url in question, perhaps using the Addressable gem or the URI module in Ruby’s standard lib.