TenJack TenJack - 5 months ago 23
Ruby Question

Nokogiri not parsing exported bookmark html from Delicious correctly

I cannot seem to figure why Nokogori is not parsing this html file correctly. This html file is a bookmark export from Delicious. It has 400 links in it but always only parses out 254 links. I have other Delicious html export files that also only find 254 links (that have differing link amounts) and one that parses the links correctly (over 2000 links), so it seems as though there may be specific links that are causing the issue, but I'm really not sure. I'm linking to the html here, since the html puts the body of this post over the character limit. This is the html:


And this is the outputted string of the html:


I'm uploading the html file with the Carrierwave gem and parsing it. This code I've been using is (where html_upload is a model instance using Carrierwave):

doc = Nokogiri::HTML.parse html_upload.file.read
puts doc.css('a').count


When Nokogiri does not parse a document as you'd expect, always check doc.errors.

Here's what I get when I try to parse the raw content from your gist:

require 'nokogiri'
doc = Nokogiri.HTML(DATA.read)
puts doc.errors.last
#=> Excessive depth in document: 256 use XML_PARSE_HUGE option

The problem here is that the HTML file has tons of unclosed tags (mostly <DT>, which Nokogiri (or rather, libxml2) is trying to nest within one another.

You can tell Nokogiri to forge on using the 'huge' config option:

doc = Nokogiri.HTML( myhtml, &:huge )

I'd personally just lightly fix up the HTML in question using gsub:

html = DATA.read
html.gsub! /<DT>.+?<\/A>$/, '\\0</DT>'
doc = Nokogiri.HTML(html)
p doc.css('a').length
#=> 399

(I checked: there are only 399 links in the file, not 400.)