MERose MERose - 8 months ago 49
Python Question

How to read an xml file with & sign

This is my


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE papers>
<title>Title containing & and more</title>

How do I read that using
? I tried

from lxml import etree

with open(xml_file, 'r') as inf:
tree = etree.parse(inf)

but it results in the following Traceback:

Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69955)
File "parser.pxi", line 1769, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102257)
File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:102516)
File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:101442)
File "parser.pxi", line 1134, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91757)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 5, column 30


If you need to retain the & character, you can parse the file as HTML.

from lxml import html
tree = html.parse(path)

If you don't need the & character, you can create a new XML parser and pass the recover=True option.

from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse(path, parser=parser)