LewlSauce LewlSauce - 27 days ago 24
Ruby Question

XPath to select all nodes between two text markers in OOXML?

I have a big XML file (from Microsoft Word) that contains tables, paragraphs, etc. I'm trying to grab all of the XML between two elements. For example, I want to grab all of the XML between these two

<w:p w:rsidR="00C82C88" w:rsidRDefault="00265695">
<w:r>
<w:t>#StartHere#</w:t>
</w:r>
</w:p>
a whole bunch of XML
<w:p w:rsidR="00C82C88" w:rsidRDefault="00265695" w:rsidP="00265695">
<w:pPr>
<w:pStyle w:val="Caption"/>
</w:pPr>
<w:r>
<w:t xml:space="preserve">Figure </w:t>
</w:r>
<w:r w:rsidR="00F044F8">
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r w:rsidR="00F044F8">
<w:instrText xml:space="preserve"> SEQ Figure \* ARABIC </w:instrText>
</w:r>
<w:r w:rsidR="00F044F8">
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:t>1</w:t>
</w:r>
<w:r w:rsidR="00F044F8">
<w:rPr>
<w:noProof/>
</w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r>
<w:r>
<w:t>: #StopHere#</w:t>
</w:r>
</w:p>


How can I have Nokogiri to grab me all of the XML between #StartHere# and #StopHere#, including those elements that this text is wrapped in? I'd like to call something like
extracted_data = document[from..stop]
somehow.

I can find those points in the document by looking for:

start = doc.at_xpath("//w:p[.//w:t[contains(., '#StartHere#')]]")
stop = doc.at_xpath("//w:p[.//w:t[contains(., '#StopHere#')]]")


but need to figure out how I can say document[start..stop] to grab everything (including those) and between it.

Answer

This XPath,

//node()[    preceding::w:p[w:r/w:t[.='#StartHere#']] 
         and following::w:p[w:r/w:t[.=': #StopHere#']]]

will select all nodes between the two paragraphs that contain your marker text.

In Nokogiri: doc.xpath("insert above XPath here")