MOZILLA MOZILLA - 2 months ago 6
Ruby Question

Parsing an HTML table using Hpricot (Ruby)

I am trying to parse an HTML table using Hpricot but am stuck, not able to select a table element from the page which has a specified id.

Here is my ruby code:-

require 'rubygems'
require 'mechanize'
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

doc = Hpricot(page.body)

puts doc.to_html # Here the doc contains the full HTML page

puts doc.search("//table[@id='gvw_offices']").first # This is NIL


Can anyone help me to identify what's wrong with this.

Answer

Mechanize will use hpricot internally (it's mechanize's default parser). What's more, it'll pass the hpricot stuff on to the parser, so you don't have to do it yourself:

require 'rubygems'
require 'mechanize'

#You don't really need this if you don't use hpricot directly
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

puts page.parser.to_html # page.parser returns the hpricot parser

puts page.at("//table[@id='gvw_offices']") # This passes through to hpricot

Also note that page.search("foo").first is equivalent to page.at("foo").