milesb milesb - 3 months ago 8
Perl Question

Why is Perl XML::LibXML changing UTF8 to 8859-1?

With this input file

<?xml version="1.0" encoding="UTF-8"?>
<entry>
<title>ú</title>
</entry>


and this code,

my $raw_xml = read_file("test.xml", binmode => 'raw');
print "$raw_xml\n";
$raw_xml =~ /<title>(.*?)</;
print "Regex finds [$1]\n"; # prints u+accent to UTF8 terminal

my $dom = XML::LibXML->load_xml(string => $raw_xml);
my $xpc = XML::LibXML::XPathContext->new($dom);
my ($entry) = $xpc->findnodes('entry');
my $title = $xpc->findvalue('title', $entry) || '';

print "title is now [$title]\n"; # prints garbage character to UTF8 terminal, u+accent to ISO-8859-1 terminal


Where/why is perfectly good utf8 being translated into one of the 8 bit character sets (I'm assuming it's 8859-1, could be cp1252 etc)?

Everything I've found via Google suggests that it should all be utf8 from end to end. But clearly it's not.

Note: the behaviour is exactly the same if I open the file on a filehandle with binmode and pass it into load_xml; I happen to have the xml in memory in the real code this is distilled from - it also means I can verify with a regex as above.

Answer

You have two bugs which cancel out to produce the correct output in the first test.


Your home-grown parser doesn't decode the document

You can observe this bug by changing /<title>(.*?)</ to /<title>(.)</. Rather than getting the first glyph (ú) as intended, it only gets the first byte of its encoding (C3).

To fix this, replace

$raw_xml =~ /<title>(.*?)</;
print "Regex finds [$1]\n";

with

use Encode qw( decode_utf8 );

my $decoded_xml = decode_utf8($raw_xml);
$decoded_xml =~ /<title>(.*?)</;
print "Regex finds [$1]\n";

Now you get the same behaviour from both tests, namely the same garbage output. This brings us to the second problem.


You don't encode your outputs

XML::LibXML returns decoded text aka Unicode Code points. ú is therefore returned as character FA since ú is U+000FA. This is proper as you shouldn't have to care about encodings except when doing I/O.

The problem happens when doing I/O. print expects each character it receives to represent a byte, so when you tell it to print character FA, it prints byte FA, and your terminal goes "wtf?".

Your terminal expects UTF-8, so you either need to encode the string using UTF-8 before passing it to print, or tell print to do it for you.

# Decode STDIN (UTF-8).
# Decode STDOUT and STDERR (UTF-8).
# The default encoding for files opened in scope is UTF-8.
use open ':std', ':encoding(UTF-8)';

Complete solution:

use open ':std', ':encoding(UTF-8)';

use Encode qw( decode_utf8 );

my $raw_xml = read_file("test.xml", binmode => 'raw');

{
   my $decoded_xml = decode_utf8($raw_xml);
   my ($title) = $decoded_xml =~ /<title>(.*?)</;
   printf("%s: [%s] [%s]\n", "Home-grown", $title, substr($title, 0, 1));
}

{
   my $doc = XML::LibXML->load_xml(string => $raw_xml );
   my ($entry_node) = $doc->findnodes('entry');
   my $title = $entry->findvalue('title');
   printf("%s: [%s] [%s]\n", "LibXML", $title, substr($title, 0, 1));
}