chell chell - 2 months ago 12
Ruby Question

How to search within a nodeset and delete a node from that same nodeset

I have the following xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document mc:Ignorable="w14 w15 wp14" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
<w:body>
<w:p w14:paraId="56037BEC" w14:textId="1188FA30" w:rsidR="001665B3" w:rsidRDefault="008B4AC6">
<w:r>
<w:t xml:space="preserve">This is the story of a man who </w:t>
</w:r>
<w:ins w:author="Mitchell Gould" w:date="2016-09-28T09:15:00Z" w:id="0">
<w:r w:rsidR="003566BF">
<w:t>went</w:t>
</w:r>
</w:ins>
<w:del w:author="Mitchell Gould" w:date="2016-09-28T09:15:00Z" w:id="1">
<w:r w:rsidDel="003566BF">
<w:delText>goes</w:delText>
</w:r>
</w:del>
...


I use Nokogiri to parse the xml as follows:

zip = Zip::File.open("test.docx")
doc = zip.find_entry("word/document.xml")
file = Nokogiri::XML.parse(doc.get_input_stream)


I have a 'deletions' nodeset that contains all of the w:del elements:

@deletions = file.xpath("//w:del")


I search inside of this nodeset to see if an element exists as follows:

my_node_set = @deletions.search("//w:del[@w:id='1']" && "//w:del/w:r[@w:rsidDel='003566BF']")


If it exists I want to remove it from the deletions nodeset. I do this with the following:

deletions.delete(my_node_set.first)


Which seems to work as no errors are returned and it displays the deleted nodeset in the terminal.

However, when I check my @deletions nodeset it seems the item is still there:

@deletions.search("//w:del[@w:id='1']" && "//w:del/w:r[@w:rsidDel='003566BF']")


I'm just getting my head around Nokogiri so I'm obviously not searching for the element properly inside of my @deletions nodeset and am instead searching the entire document.

How can I search inside of the @deletions nodeset for the element and then delete it from the nodeset?

Answer

Consider this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div id="foo"><p>foo</p></div>
    <div id="bar"><p>bar</p></div>
  </body>
</html>
EOT

divs contains the div tags, which are a NodeSet:

divs = doc.css('div')
divs.class  # => Nokogiri::XML::NodeSet

And contains:

divs.to_html # => "<div id=\"foo\"><p>foo</p></div><div id=\"bar\"><p>bar</p></div>"

You can search a NodeSet using at to find the first match:

divs.at('#foo').to_html # => "<div id=\"foo\"><p>foo</p></div>"

And you can easily remove it:

divs.at('#foo').remove

Which removes it from the document itself:

puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <body>
# >>     
# >>     <div id="bar"><p>bar</p></div>
# >>   </body>
# >> </html>

It doesn't delete it from the NodeSet, but we don't care about that, the NodeSet is just a pointer to the nodes in the document itself used to give a list of what to delete.

If you then want an updated NodeSet after deleting certain nodes, rescan the document and rebuild the NodeSet:

divs = doc.css('div')
divs.to_html # => "<div id=\"bar\"><p>bar</p></div>"

If your goal is to remove all the nodes in the NodeSet, instead of searching through that list you can simply use:

divs.remove
puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <body>
# >>     
# >>     
# >>   </body>
# >> </html>

When I'm deleting nodes I don't gather an intermediate NodeSet, instead I do it on the fly using something like:

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div id="foo"><p>foo</p></div>
    <div id="bar"><p>bar</p></div>
  </body>
</html>
EOT

doc.at('div#bar p').remove

puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <body>
# >>     <div id="foo"><p>foo</p></div>
# >>     <div id="bar"></div>
# >>   </body>
# >> </html>

which deletes the embedded <p> tag in #bar. By relaxing the selector and changing from at to search I can remove them en masse:

doc.search('div p').remove

puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <body>
# >>     <div id="foo"></div>
# >>     <div id="bar"></div>
# >>   </body>
# >> </html>

If you insist on walking through the NodeSet, remember that they are like arrays, and you can treat them as such. Here's an example of using reject to skip a particular node:

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div id="foo"><p>foo</p></div>
    <div id="bar"><p>bar</p></div>
  </body>
</html>
EOT

divs = doc.search('div').reject{ |d| d['id'] == 'foo' }
divs.map(&:to_html) # => ["<div id=\"bar\"><p>bar</p></div>"]

You won't receive a NodeSet though, you'll get an Array:

divs.class # => Array

While you can do that, you're better off using a specific selector to reduce the set rather than rely on Ruby to select or reject elements.