bryansis2010 bryansis2010 - 1 year ago 101
Python Question

xml.etree.ElementTree for chinese

Given an XML with chinese characters, I would like to use

to help me parse the XML to do some processing. The English version works. For example:

>el.xml printf '%s\n' $'<?xml version=\'1.0\' encoding=\'utf8\'?><Color>Grey</Color>'
>cl.xml printf '%s\n' $'<?xml version=\'1.0\' encoding=\'utf8\'?><Color>灰色</Color>'

tryParse() {
python -c 'import xml.etree.ElementTree as ET; import sys; ET.parse(sys.argv[1])' "[email protected]"

tryParse el.xml && printf '%s\n\n' "English works"
tryParse cl.xml && printf '%s\n\n' "Chinese works"

...emits as output:

English works

Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/", line 1182, in parse
tree.parse(source, parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/", line 656, in parse
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/", line 1642, in feed
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 44

Answer Source

Use lxml instead:

>>> import lxml.etree as ET
>>> doc = ET.parse('cl.xml')
>>> print doc.getroot().text
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download