Niels Kristian Niels Kristian - 7 months ago 36
Ruby Question

Parse/read Large XML file with minimal memory footprint

I have a very large XML file (300mb) of the following format:

<data>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
</data>


Now I need to read it and iterate through the
point
nodes doing something for each. Currently I'm doing it with Nokogiri like this:

require 'nokogiri'
xmlfeed = Nokogiri::XML(open("large_file.xml"))
xmlfeed.xpath("./data/point").each do |item|
save_id(item.xpath("./id").text)
end


However that's not very efficient, since it parses everything whole hug, and hence creating a huge memory footprint (several GB).

Is there a way to do this in chunks instead? Might be called streaming if I'm not mistaken?

EDIT

The suggested answer using nokogiris sax parser might be okay, but it gets very messy when there is several nodes within each
point
that I need to extract content from and process differently. Instead of returning a huge array of entries for later processing, I would much rather prefer if I could access one
point
at a time, process it, and then move on to the next "forgetting" the previous.

Answer

Given this little-known (but AWESOME) gist using Nokogiri's Reader interface, you should be able to do this:

Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
  inside_element 'point' do
    for_element 'id' do puts "ID: #{inner_xml}" end
    for_element 'time' do puts "Time: #{inner_xml}" end
  end
end

Someone should make this a gem, perhaps me ;)

Comments