Aillyn Aillyn - 1 year ago 129
Python Question

Why is ElementTree raising a ParseError?

I have been trying to parse a file with


import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError

def analyze(xml):
it = ET.iterparse(file(xml))
count = 0
last = None

for (ev, el) in it:
count += 1
last = el

except ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))

print('count: {0}'.format(count))

This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:

Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
from yparse import analyze; analyze('file.xml')
File "C:\Python27\", line 10, in analyze
for (ev, el) in it:
File "C:\Python27\lib\xml\etree\", line 1258, in next
File "C:\Python27\lib\xml\etree\", line 1624, in feed
File "C:\Python27\lib\xml\etree\", line 1488, in _raiseerror
raise err
ParseError: reference to invalid character number: line 1, column 52459

The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.

The strangest thing is I'm using the trace to find out if I have any malformed XML that's breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!

This doesn't seem to be a size problem either. I have managed to parse much larger files with no problems.

Any ideas?

Answer Source

As @John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.

In fact, all of these entities appear in the text:

set(['&#x08;', '&#x0E;', '&#x1E;', '&#x1C;', '&#x18;', '&#x04;', '&#x0A;', '&#x0C;', '&#x16;', '&#x14;', '&#x06;', '&#x00;', '&#x10;', '&#x02;', '&#x0D;', '&#x1D;', '&#x0F;', '&#x09;', '&#x1B;', '&#x05;', '&#x15;', '&#x01;', '&#x03;'])

Most are not allowed. Looks like this parser is quite strict, you'll need to find another that is not so strict, or pre-process the XML.