max pleaner max pleaner - 4 months ago 6
Ruby Question

Parsing a simple XML-like string with adjacent nodes

I'm using the

gem to classify a sentence according to its parts of speech. The output I get is as follows:

puts text
# => "<nnp>My</nnp> <nn>name</nn> <vbz>is</vbz> <nnp>Max</nnp>"

I would have expected the gem to give me an array, but I guess I'll have to coerce this into an array myself.

What I'm eventually trying to get is a nested array something like this:

[["My", "nnp"], ["name", "nn"], ["is", "vbz"], ["Max", "nnp"]]

However I'm not really sure how to approach this with Nokogiri (or another parser library). Here's what I've tried:

(byebug) doc = Nokogiri::XML(text)
#<Nokogiri::XML::Document:0x3fd400286e78 name="document" children=[#<Nokogiri::XML::Element:0x3fd400286900 name="nnp" children=[#<Nokogiri::XML::Text:0x3fd400286464 "My">]>]>
(byebug) Nokogiri.parse(text)
#<Nokogiri::XML::Document:0x3fd40028cd50 name="document" children=[#<Nokogiri::XML::Element:0x3fd40028c7d8 name="nnp" children=[#<Nokogiri::XML::Text:0x3fd40028c378 "My">]>]>

So I've tried two different Nokogiri methods, but both are only showing the first node. How can I get the rest of the adjacent nodes as well?

Alternatively, how can I get the
call to return an array? In the docs, I didn't find an example of how to return an array with all tags, only arrays with one specific kind of tag.


The main thing is that well-formed XML should have a root node. You were receiving the very first node only because it was treated as the root (that said, the topmost) node and as it was closed, Nokogiri considered the XML document to be ended.

  children.first. # get root node { |e| [e.text,] }. # map to what’s needed
  reject { |e| e.last == 'text' } # filter out garbage

That filtering might be more semantically correct:

  children.reject { |e| Nokogiri::XML::Text === e }.
  map { |e| [e.text,] }