TheCha͢mp TheCha͢mp - 6 months ago 19
Ruby Question

How do I remove white space between HTML nodes?

I'm trying to remove whitespace from an HTML fragment between

<p>
tags

<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>


as you can see, there always is a blank space between the
<p> </p>
tags.

The problem is that the blank spaces create
<br>
tags when saving the string into my database.
Methods like
strip
or
gsub
only remove the whitespace in the nodes, resulting in:

<p>FooBar</p> <p>barbarbar</p> <p>bla</p>


whereas I'd like to have:

<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>


I'm using:


  • Nokogiri 1.5.6

  • Ruby 1.9.3

  • Rails



UPDATE:



Occasionally there are children nodes of the
<p>
Tags that generate the same problem: white space between

Sample Code

Note: the Code normally is in one Line, I reformatted it because it would be unbearable otherwise...

<p>
<p>
<strong>Selling an Appartment</strong>
</p>
<ul>
<li>
<p>beautiful apartment!</p>
</li>
<li>
<p>near the train station</p>
</li>
.
.
.
</ul>
<ul>
<li>
<p>10 minutes away from a shopping mall </p>
</li>
<li>
<p>nice view</p>
</li>
</ul>
.
.
.
</p>


How would I strip those white spaces aswell?

SOLUTION



It turns out that I messed up using the
gsub
method and didn't further investigate the possibility of using
gsub
with
regex
...

The simple solution was adding

data = data.gsub(/>\s+</, "><")


It deleted whitespace between all different kinds of nodes... Regex ftw!

Answer

This is how I'd write the code:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>Foo Bar</p> <p>bar bar bar</p> <p>bla</p>
EOT

doc.search('p, ul, li').each { |node| 
  next_node = node.next_sibling
  next_node.remove if next_node && next_node.text.strip == ''
}

puts doc.to_html

It results in:

<p>Foo Bar</p><p>bar bar bar</p><p>bla</p>

Breaking it down:

doc.search('p')

looks for only the <p> nodes in the document. Nokogiri returns a NodeSet from search, or a nil if nothing matched. The code loops over the NodeSet, looking at each node in turn.

next_node = node.next_sibling

gets the pointer to the next node following the current <p> node.

next_node.remove if next_node && next_node.text.strip == ''

next_node.remove removes the current next_node from the DOM if the next node isn't nil and its text isn't empty when stripped, in otherwords, if the node has only whitespace.

There are other techniques to locate only the TextNodes if all of them should be stripped from the document. That's risky, because it can end up deleting all blanks between tags, causing run-on sentences and joined words, which probably isn't what you want.