Ian Ian - 1 year ago 111
Perl Question

Improper UTF-8 and LibXML::Reader

I have a large XML file from a remote source, that says it is 'UTF8', file shows us-ascii.

<?xml version="1.0" encoding="utf-8"?>...

file -bi <file> indicates application/xml; charset=us-ascii
Encode::Guess indicates UTF8

Edit: There is also some code which reads in the file, originally output from a LWP get...I have also try to force some encoding here, but get other errors like wide chars.

my $fh = IO::File->new;
$fh->open( '<' . $filename )
$content = join '', <$fh>;

I am using XML::Reader

my $reader = XML::LibXML::Reader->new(string => $content) or die qq(cannot read content: $!);

while ($reader->nextElement($template->{ 'item' } )) {
my $copy = $reader->copyCurrentNode(1);
my $test = $copy->findvalue( 'description' )
...# do other stuff with $copy

This works fine through most of the contents. However, there looks to be some invalid utf-8 or malformed data as it gives an error half way through..

(note, in XML::Bare the whole xml is processed 'fine' as its more forgiving, but the file is on the limit of memory size, so I need a smaller memory xml parser).

Entity: line 64070: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x1A 0x73 0x20 0x73

If I look in vim at the point after last success, I can see

^Z or <^Z> 26, Hex 1a, Octal 032 with :ascii in vim

I have looked here on SO to try and ensure at least valid UTF-8 as I can't get the origin fixed, and trying...

use Encode qw( encode decode );
my $octets = decode('UTF-8', $content, Encode::FB_DEFAULT );
$content = encode('UTF-8', $octets, Encode::FB_CROAK );

But I still get the same error. I am happy to skip any parts with invalid UTF-8, but the whole parser dies, and I can't see any way to carry on processing later (which I believe is supposed to happen with XML parsing).

My question is, is this the best way to guarantee UTF-8 (assuming I can't get the file changed), or is there a method that should get around the error (I could probably regex that particular char out, but I'm assuming there may be other similar issues later, so feels clunky) ?

Answer Source

The error message is misleading; the problem has nothing to do with encoding[1]. In fact, the error I receive is the following[2]:

:1: parser error : PCDATA invalid Char value 26

From the XML spec,

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

U+001A may not legally appear in XML files, not even as an entity reference (&#x1A;).

If the file is to contain binary data, the binary portions should be encoded (e.g. using base64).

  1. 1A, 20 and 73 are all less than 80.

  2. I tested using XML::LibXML rather than XML::LibXML::Reader, but I suspect the relevant difference is actually a difference in the version of XML::LibXML or libxml2.