Terry Li Terry Li - 1 year ago 123
HTML Question

Failing to extract html table rows

enter image description here

I try to extract all five rows listed in the table above.

I'm using Ruby hpricot library to extract the table rows using xpath expression.

In my example, the xpath expression I use is /html/body/center/table/tr. Note that I've removed the tbody tag from the expression, which is usually the case for successful extraction.

The weird thing is that I'm getting the first three rows in the result with the last two rows missing. I just have no idea what's going on there.

EDIT: Nothing magic about the code, just attaching it upon request.

require 'open-uri'
require 'hpricot'

faculty = Hpricot(open("http://www.utm.utoronto.ca/7800.0.html"))
(faculty/"/html/body/center/table/tr").each do |text|
puts text.to_s

Answer Source

The HTML document in question is invalid. (See http://validator.w3.org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html.) Hpricot parses it in another way than your browser — hence the different results — but it can't really be blamed. Until HTML5, there was no standard on how to parse invalid HTML documents.

I tried replacing Hpricot with Nokogiri and it seems to give the expected parse. Code:

require 'open-uri'
require 'nokogiri'

faculty = Nokogiri.HTML(open("http://www.utm.utoronto.ca/7800.0.html"))

faculty.search("/html/body/center/table/tr").each do |text|
  puts text

Maybe you should switch?

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download