cajwine cajwine - 6 months ago 48
Perl Question

How to skip unwanted elements using XML::Twig?

Trying to learn XML::Twig and fetch some data from an XML document.

My XML contains 20k+

<ADN>
elements. Eaach
<ADN>
element contains tens of child elements, one of them is the
<GID>
. I want process only those
ADN
where the
GID
== 1. (See the example XML is the
__DATA__
)

The docs says:


Handlers are triggered in fixed order, sorted by their type (xpath
expressions first, then regexps, then level), then by whether they
specify a full path (starting at the root element) or not, then by
number of steps in the expression , then number of predicates, then
number of tests in predicates. Handlers where the last step does not
specify a step (foo/bar/*) are triggered after other XPath handlers.
Finally all handlers are triggered last.

Important: once a handler has been triggered if it returns 0 then no
other handler is called, except a all handler which will be called
anyway.


My actual code:

use 5.014;
use warnings;
use XML::Twig;
use Data::Dumper;

my $cat = load_xml_catalog();
say Dumper $cat;

sub load_xml_catalog {
my $hr;
my $current;
my $twig= XML::Twig->new(
twig_roots => {
ADN => sub { # process the <ADN> elements
$_->purge; # and purge when finishes with one
},
},
twig_handlers => {
'ADN/GID' => sub {
return 1 if $_->trimmed_text == 1;
return 0; # skip the other handlers - if the GID != 1
},

'ADN/ID' => sub { #remember the ID as a "key" into the '$hr' for the "current" ADN
$current = $_->trimmed_text;
$hr->{$current}{$_->tag} = $_->trimmed_text;
},

#rules for the wanted data extracting & storing to $hr->{$current}
'ADN/Name' => sub {
$hr->{$current}{$_->tag} = $_->text;
},
},
);
$twig->parse(\*DATA);
return $hr;
}
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>


It outputs

$VAR1 = {
'1000' => {
'ID' => '1000',
'Name' => 'other name 1000'
},
'1' => {
'Name' => 'name 1',
'ID' => '1'
},
'20' => {
'Name' => 'should be skipped because GID != 1',
'ID' => '20'
}
};


So,


  • The handler for the
    ADN/GID
    returns
    0
    when the GID != 1.

  • Why the other handlers are still called?

  • The expected (wanted) output is without the
    '20' => ...
    .

  • How to skip the unwanted nodes correctly?


Answer

The "returns zero" thing is a bit of a red herring in this context. If you had multiple matches on your element, then one of them returning zero would inhibit the others.

That doesn't mean it won't still try and process subsequent nodes.

I think you're getting confused - you have handlers for separate subelements of your <ADN> elements - and they trigger separately. That's by design. There is a precedence order for xpath but only on duplicate matches. Yours are completely separate though, so they all 'fire' because they trigger on different elements.

However, you might find it useful to know - twig_handlers allows xpath expressions - so you can explicitly say:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;
my $twig = XML::Twig->parse( \*DATA );
$twig -> set_pretty_print('indented_a');

foreach my $ADN ( $twig -> findnodes('//ADN/GID[string()="1"]/..') ) {
   $ADN -> print;
}

This also works in the twig_handlers syntax. I would suggest doing a handler is only really useful if you need to pre-process your XML, or you're memory constrained. With 20,000 nodes, you may be. (at which point purge is your friend).

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;
my $twig = XML::Twig->new(
   pretty_print  => 'indented_a',
   twig_handlers => {
      '//ADN[string(GID)="1"]' => sub { $_->print }
   }
);

$twig->parse( \*DATA );


__DATA__
<ArrayOfADN>
    <ADN>
        <GID>1</GID>
        <ID>1</ID>
        <Name>name 1</Name>
    </ADN>
    <ADN>
        <GID>2</GID>
        <ID>20</ID>
        <Name>should be skipped because GID != 1</Name>
    </ADN>
    <ADN>
        <GID>1</GID>
        <ID>1000</ID>
        <Name>other name 1000</Name>
    </ADN>
</ArrayOfADN>

Although, I would probably just do it this way instead:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;

sub process_ADN {
    my ( $twig, $ADN ) = @_; 
    return unless $ADN -> first_child_text('GID') == 1;
    print "ADN with name:", $ADN -> first_child_text('Name')," Found\n";
}


my $twig = XML::Twig->new(
   pretty_print  => 'indented_a',
   twig_handlers => {
      'ADN' => \&process_ADN
   }
);

$twig->parse( \*DATA );


__DATA__
<ArrayOfADN>
    <ADN>
        <GID>1</GID>
        <ID>1</ID>
        <Name>name 1</Name>
    </ADN>
    <ADN>
        <GID>2</GID>
        <ID>20</ID>
        <Name>should be skipped because GID != 1</Name>
    </ADN>
    <ADN>
        <GID>1</GID>
        <ID>1000</ID>
        <Name>other name 1000</Name>
    </ADN>
</ArrayOfADN>