Prody Prody - 22 days ago 8
Python Question

What's the best way to handle  -like entities in XML documents with lxml?

Consider the following:

from lxml import etree
from StringIO import StringIO

x = """<?xml version="1.0" encoding="utf-8"?>\n<aa>&nbsp;&acirc;</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)


This would fail with:

lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 2, column 11


This is because
resolve_entities=False
doesn't ignore them, it just doesn't resolve them.

If I use
etree.HTMLParser
instead, it creates
html
and
body
tags, plus a lot of other special handling it tries to do for
HTML
.

What's the best way to get a
&nbsp;&acirc;
text child under the
aa
tag with lxml?

Answer

You can't ignore entities as they are part of the XML definition. Your document is not well-formed if it doesn't have a DTD or standalone="yes" or if it includes entities without an entity definition in the DTD. Lie and claim your document is HTML.

https://mailman-mail5.webfaction.com/pipermail/lxml/2008-February/003398.html

You can try lying and putting an XHTML DTD on your document. e.g.

from lxml import etree
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >\n<aa>&nbsp;&acirc;</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
etree.tostring(r) # '<aa>&nbsp;&acirc;</aa>'