capfan capfan - 4 months ago 22
Perl Question

Keep encoded tag in XML::Twig



I would like to modify a large XML file using

.

When using handler callbacks,
XML::Twig
seems to change characters that are encoded as HTML entities such as the greater than sign (
>
--
>
).

Example script:

my $input = q~
<root>
<p>&lt;encoded tag&gt;</p>
</root>
~;

my $t = XML::Twig->new(
keep_spaces => 1,
twig_roots => { 'p' => \&convert, }, # process p tags
twig_print_outside_roots => 1, # print the rest
);

$t->parse($input);


sub convert {
my ($t, $p)= @_;

$p->set_att('x' => 'y');

$p->print;
}


This will turn the document into the following:

<root>
<p x="y">&lt;encoded tag></p>
</root>


I was expecting to get this:

<root>
<p x="y">&lt;encoded tag&gt;</p>
</root>


How do I keep the encoded contents of tags using
XML::Twig
?

Answer

You need to either set the keep_encoding option in the constructor, as below, or call $twig->set_keep_encoding($option) to modify it after the construction of the object

Note that the module documentation says this about it

This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and you want to keep it that way, then setting keep_encoding will use the "Expat" original_string method for character, thus keeping the original encoding, as well as the original entities in the strings.

But here it is, doing as you asked. The risk is your own call

use strict;
use warnings 'all';

use XML::Twig;

my $input = <<END_XML;
<root>
    <p>&lt;encoded tag&gt;</p>
</root>
END_XML

my $t = XML::Twig->new(
    keep_spaces              => 1,
    keep_encoding            => 1,
    twig_roots               => { p => \&convert },   # process p elements
    twig_print_outside_roots => 1,                    # print the rest
);

$t->parse($input);


sub convert {
    my ($t, $p) = @_;
    $p->print;
}

output

<root>
    <p>&lt;encoded tag&gt;</p>
</root>