Schemer Schemer - 1 year ago 87
Python Question

Python: null character breaking xml format

I have some python code that processes input files and dumps certain fields from the input to XML files. This code broke when passing a null character from input -- throwing an invalid token error:

def pretty_print_xml(elem):

rough_string = ET.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=' ')

This surprised me and I would like to know why it broke and what else might need to be sanitized from the input. I thought only an XML meta character could throw this error and these are already being handled by minidom.

Answer Source

NUL literals are not allowed in XML. See the XML standard, version 1.1:

2.2 Characters

[Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any character in the range specified for Char.]

[2]       Char       ::=      [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[2a]      RestrictedChar     ::=      [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

Note that Char is defined to allow (among other ranges) \x01 through \xD7FF -- but not \x00.

By the way -- if your goal is pretty-printing, I'd suggest using lxml.etree. If the pretty_print=True argument on serialization calls doesn't work out-of-the-box, see the relevant FAQ entry.