CodingMo CodingMo - 18 days ago 11
Ruby Question

How to avoid creating non-significant white space text nodes when creating a `Nokogiri::XML` or `Nokogiri::HTML` object

I'll start off by explaining the motivation behind this. First of all, I would like to count the

children
of each node in a Nokogiri XML
Document
and access the first or last child that is not a non-significant white space.

A non-significant white space text node is usually created when you are parsing an indented XML representation. For example, if you have the following XML:

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>


Whose string representation is as follows:

"<note>\n <to>Tove</to>\n <from>Jani</from>\n <heading>Reminder</heading>\n <body>Don't forget me this weekend!</body>\n</note>\n"


When creating a Nokogiri::XML object from this string representation, you get the following
Document
created:

#(Document:0x3fc07e4540d8 {
name = "document",
children = [
#(Element:0x3fc07ec8629c {
name = "note",
children = [
#(Text "\n "),
#(Element:0x3fc07ec8089c {
name = "to",
children = [ #(Text "Tove")]
}),
#(Text "\n "),
#(Element:0x3fc07e8d8064 {
name = "from",
children = [ #(Text "Jani")]
}),
#(Text "\n "),
#(Element:0x3fc07e8d588c {
name = "heading",
children = [ #(Text "Reminder")]
}),
#(Text "\n "),
#(Element:0x3fc07e8cf590 {
name = "body",
children = [ #(Text "Don't forget me this weekend!")]
}),
#(Text "\n")]
})]
})


Here, you get lots white space nodes of type
Nokogiri::XML::Text
. Those are the nodes that I wish to not be parsed, or somehow find a way to make a distinction between those and the ones that are significant such as the text node that represents the string "Tove".




Just to be clear, I do not want the whitespace between a closing and an opening tag represented. Or, if I cannot avoid that, distinguish them from significant text nodes like tha text node inside the element
<to>
.

Here is an example rspec of something that I am looking for:

require 'nokogiri'
require_relative 'spec_helper'

xml_text = <<XML
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
XML

xml = Nokogiri::XML(xml_text)

def significant_nodes(node)
return 0
end

describe "Stackoverflow Question" do
it "should return the number of significant nodes in nokogiri." do
expect(significant_nodes(xml.css('note'))).to eq 4
end
end


So essentially, I want to know how to create the
significant_nodes
function.

If I change the XML to:

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<footer></footer>
</note>


Then when I create the Document, I still would like footer represented, so using the
config.noblanks
is not an option.

Answer

You can use the NOBLANKS option for parsing the XML string, consider this example:

require 'nokogiri'

string = "<foo>\n  <bar>bar</bar>\n</foo>"
puts string
# <foo>
#   <bar>bar</bar>
# </foo>

document_with_blanks = Nokogiri::XML.parse(s)

document_without_blanks = Nokogiri::XML.parse(s) do |config|
  config.noblanks
end

document_with_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Text:0x3ffa4e153dac "\n  ">
#<Nokogiri::XML::Element:0x3fdce3f78488 name="bar" children=[#<Nokogiri::XML::Text:0x3fdce3f781f4 "bar">]>
#<Nokogiri::XML::Text:0x3ffa4e15335c "\n">

document_without_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3f81bef42034 name="bar" children=[#<Nokogiri::XML::Text:0x3f81bef43ee8 "bar">]>

The NOBLANKS shouldn't remove empty nodes:

doc = Nokogiri.XML('<foo><bar></bar></foo>') do |config|
  config.noblanks
end

doc.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3fad0fafbfa8 name="bar">

As OP pointed out the documentation on the Nokogiri website (and also on the libxml website) about the parser options is quite cryptic, following a specification of the behaviour ot the NOBLANKS option:

require 'rspec/autorun'
require 'nokogiri'

def parse_xml(xml_string)
  Nokogiri.XML(xml_string) { |config| config.noblanks }
end

describe "Nokogiri NOBLANKS parser option" do

  it "removes whitespace nodes if they have siblings" do
    doc = parse_xml("<root>\n <child></child></root>")
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end

  it "doesn't remove whitespaces nodes if they have no siblings" do
    doc = parse_xml("<root>\n </root>")
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Text)
  end

  it "doesn't remove empty nodes" do
    doc = parse_xml('<root><child></child></root>')
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end

end