Ajay Ajay - 25 days ago 6
Ruby Question

Unable to parse html contents of this page using nokogiri

require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open url)


I am trying to fetch the basic set of informations like
event_name, categories, sponsor, venue, event_location, cost
.


For example, for event_name I have got this xpath:
"/html/body/div[2]/div[2]/div[1]/h3/a/span"


puts doc.xpath "/html/body/div[2]/div[2]/div[1]/h3/a/span"


This is nil for event name.

If I save the URL contents locally then above xpath works.

Along with this, I need above mentioned informations as well.
Checked the other xpaths too.But result turns to be blank.

Answer

The provided link contains xml. So your xpath expressions should work with xml structure.
The key thing is that the document has namespaces. As I understand all xpath expressions should keep that in mind and specify namespaces too.
In order to simply xpath expressions one can use remove_namespaces! method.

Try this code

require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open(url)); nil # nil is used to avoid huge output

doc.remove_namespaces!; nil
event = doc.xpath('//feed/entry[1]') # it will give you the first event

event.xpath('./title').text # => "Conservation Clinics"
event.xpath('./categories').text # => "Demonstrations,Lectures & Discussions"

Most likely you would like to have array of all event hashes.
You can do it like

doc.xpath('//feed/entry').reduce([]) do |memo, event|
  event_hash = {
    title: event.xpath('./title').text,
    categories: event.xpath('./categories').text
    # all other attributes you need ...
  }
  memo << event_hash
end

It will give you an array like

[
  {:title=>"Conservation Clinics", :categories=>"Demonstrations,Lectures & Discussions"}, 
  {:title=>"Castle Highlights Tour", :categories=>"Gallery Talks & Tours"}, 
  ...
]
Comments