UsefulUserName UsefulUserName - 4 months ago 21
Perl Question

Perl: compare lines in an xml file

I've got an xml file which looks a bit like this:

<root>
<project id="1">
<element name="stuff" version="1.0"/>
<element name="stuff" version="1.2"/>
<element name="table" version="0.8"/>
</project>
<project id="2">
<element name="fruit" version="1.0"/>
<element name="tree" version="1.2"/>
<element name="tree" version="0.8"/>
<element name="tree" version="2.5"/>
</project>
</root>


What I would like to is to delete all the elements with inferior version numbers. What I know so far is to read in the file and detect the lines which contain the elements:

open(FILE, "<file.xml");
my @line = <FILE>;
close(FILE);
open(FILE, ">file.xml");
foreach my $line (@line) {
if (index ($line, '<element') != -1) {
#only print newer versions here
}
}


But now I am not sure how to go on. I know I can compare version numbers like this:
version->parse($variable1) < version->parse($variable2)
but how can I compare two lines of the same file and then delete the one with the older version number?

Answer

Something like this will do it:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
use Data::Dumper;

#set a handler for 'element' - this just collects the highest 'versions'.
my %version_of; 
sub get_highest_version {
    my ( $twig, $element ) = @_; 
    my $name = $element -> att('name'); 
    my $version = $element -> att('version'); 
    if ( not defined $version_of{$name}
         or $version_of{$name} < $version ) {
            $version_of{$name} = $version; 
    }
}

#create a parser, set it to use the above handler for 'element' elements. 
my $twig = XML::Twig->new ( twig_handlers => { 'element' => \&get_highest_version } );
#parse the data (in __DATA__ below - you probably want to use 'parsefile' instead)
$twig -> parse( \*DATA );

#output for debug - see what the highest versions of each actually were. 
print Dumper \%version_of;

#iterate each of the 'element' nodes. 
foreach my $element ( $twig -> get_xpath ('//element') ) {
    #extract name/version from this element. 
    my $name = $element -> att('name'); 
    my $version = $element -> att('version');
    #delete this node unless it's the highest version.  
    $element -> delete unless $version >= $version_of{$name}; 
}

#set output indentation and print
$twig -> set_pretty_print('indented_a');
$twig -> print;


__DATA__
<root>
  <project id="1">
    <element name="stuff" version="1.0"/>
    <element name="stuff" version="1.2"/>
    <element name="table" version="0.8"/>
  </project>
  <project id="2">
    <element name="fruit" version="1.0"/>
    <element name="tree" version="1.2"/>
    <element name="tree" version="0.8"/>
    <element name="tree" version="2.5"/>
  </project>
</root>

Although note - this does mean you might see duplicates if you have two equal versions. It also ignores the 'project' hierarchy entirely - it looks for global highest versions. (You could do this quite easily by tracking the project-id though)

Comments