Ducati007 Ducati007 - 1 month ago 10
Perl Question

Split large XML into smaller files based on chunks of child nodes using unix script

I could do the same thing in java or c# with ease but doing this in shell scripting involves lot of learning...so any help is appreciated

I have a huge xml node with child nodes like URL (lets say 100K nodes) and I need to split the input.xml with 10K nodes in each subfile,so I get 10 files containing 10K nodes with parent tag in tact (URLSet tab).



<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
<url>
<loc> https://www.mywebsite.com/shopping </loc>
<changefreq> Weekly </changefreq>
<priority> 0.8 </priority>
<lastmod> 2016-09-22 </lastmod>
</url>
</urlset>

Answer

Short answer is yes, this is totally doable.

XML::Twig supports "cut" and "paste" operations, as well as incremental parsing (for lower memory footprint).

So you'd do something like:

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

#new document. Manually set xmlns - could copy this from 'original'
#instead though. 
my $new_doc = XML::Twig->new;
$new_doc->set_root(
   XML::Twig::Elt->new(
      'urlset', { xmlns => "http://www.sitemaps.org/schemas/sitemap/0.9" }
   )
);
$new_doc->set_pretty_print('indented_a');

my $elt_count    = 0;
my $elts_per_doc = 2;
my $count_of_xml = 0;

#handle each 'url' element. 
sub handle_url {
   my ( $twig, $elt ) = @_;
   #more than the count, we output this doc, close it,
   #then create a new one. 
   if ( $elt_count >= $elts_per_doc ) {
      $elt_count = 0;
      open( my $output, '>', "new_xml_" . $count_of_xml++ . ".xml" )
        or warn $!;
      print {$output} $new_doc->sprint;
      close($output);
      $new_doc = XML::Twig->new();
      $new_doc->set_root(
         XML::Twig::Elt->new(
            'urlset',
            { xmlns => "http://www.sitemaps.org/schemas/sitemap/0.9" }
         )
      );
      $new_doc->set_pretty_print('indented_a');
   }

   #cut this element, paste it into new doc. 
   #note - this doesn't alter the original on disk - only the 'in memory' 
   #copy. 
   $elt->cut;
   $elt->paste( $new_doc->root );
   $elt_count++;
   #purge clears any _closed_ tags from memory, so it preserves 
   #structure.
   $twig->purge;
}

#set a handler, start the parse.

my $twig = XML::Twig->new( twig_handlers => { 'url' => \&handle_url } ) ->parsefile ( 'your_file.xml' ); 
Comments