nevan king nevan king - 7 months ago 21
Ruby Question

Is there a way to get the raw HTML from Nokogiri?

I've seen the "How to get the raw HTML source code for a page by using Ruby or Nokogiri?" which uses something like this:

file = open("index.html")
puts file.read
page = Nokogiri::HTML(file)


But it seems to move the read point to the end of the file so that Nokogiri can't read the file anymore. If I swap the
read
and Nokogiri call:

file = open("index.html")
puts file.read
page = Nokogiri::HTML(file)


The file is no longer output. I'd like to be able to query Nokogiri for the HTML it used originally, so that I can do my own extra parsing on the raw source. Ideally, I'd like something like

file = open("index.html")
page = Nokogiri::HTML(file)
raw_html = page.html


Note: I've also tried
page.to_html
, but it seems to change the formatting slightly.

Answer

You usually pass a File instance so it can be processed by chunks, but passing a string is also ok:

html = File.read("index.html")
page = Nokogiri::HTML(html)
page_html = page.html
Comments