Ajay Ajay - 27 days ago 8
Ruby Question

How to to parse HTML contents of a page using Nokogiri

require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open url)


I am trying to fetch the basic set of information like:

event_name
categories
sponsor
venue
event_location
cost


For example, for
event_name
I have this xpath:

"/html/body/div[2]/div[2]/div[1]/h3/a/span"


And use it like:

puts doc.xpath "/html/body/div[2]/div[2]/div[1]/h3/a/span"


This returns nil for
event_name
.

If I save the URL contents locally then above XPath works.

Along with this, I need above mentioned information as well. I checked the other XPaths too, but the result turns out to be blank.

Answer

The provided link contains XML, so your XPath expressions should work with XML structure.

The key thing is that the document has namespaces. As I understand all XPath expressions should keep that in mind and specify namespaces too.
In order to simply XPath expressions one can use the remove_namespaces! method:

require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open(url)); nil # nil is used to avoid huge output

doc.remove_namespaces!; nil
event = doc.xpath('//feed/entry[1]') # it will give you the first event

event.xpath('./title').text # => "Conservation Clinics"
event.xpath('./categories').text # => "Demonstrations,Lectures & Discussions"

Most likely you would like to have array of all event hashes.
You can do it like:

doc.xpath('//feed/entry').reduce([]) do |memo, event|
  event_hash = {
    title: event.xpath('./title').text,
    categories: event.xpath('./categories').text
    # all other attributes you need ...
  }
  memo << event_hash
end

It will give you an array like:

[
  {:title=>"Conservation Clinics", :categories=>"Demonstrations,Lectures & Discussions"}, 
  {:title=>"Castle Highlights Tour", :categories=>"Gallery Talks & Tours"}, 
  ...
]