Ajay Ajay - 5 months ago 26
Ruby Question

How to to parse HTML contents of a page using Nokogiri

require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open url)

I am trying to fetch the basic set of information like:


For example, for
I have this xpath:


And use it like:

puts doc.xpath "/html/body/div[2]/div[2]/div[1]/h3/a/span"

This returns nil for

If I save the URL contents locally then above XPath works.

Along with this, I need above mentioned information as well. I checked the other XPaths too, but the result turns out to be blank.


The provided link contains XML, so your XPath expressions should work with XML structure.

The key thing is that the document has namespaces. As I understand all XPath expressions should keep that in mind and specify namespaces too.
In order to simply XPath expressions one can use the remove_namespaces! method:

require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open(url)); nil # nil is used to avoid huge output

doc.remove_namespaces!; nil
event = doc.xpath('//feed/entry[1]') # it will give you the first event

event.xpath('./title').text # => "Conservation Clinics"
event.xpath('./categories').text # => "Demonstrations,Lectures & Discussions"

Most likely you would like to have array of all event hashes.
You can do it like:

doc.xpath('//feed/entry').reduce([]) do |memo, event|
  event_hash = {
    title: event.xpath('./title').text,
    categories: event.xpath('./categories').text
    # all other attributes you need ...
  memo << event_hash

It will give you an array like:

  {:title=>"Conservation Clinics", :categories=>"Demonstrations,Lectures & Discussions"}, 
  {:title=>"Castle Highlights Tour", :categories=>"Gallery Talks & Tours"},