Justin Smith Justin Smith - 6 months ago 130
Python Question

lxml.etree.XMLSyntaxError: htmlParseEntityRef: expecting ';'

I'm trying to figure out the python lxml api, but am running into a peculiar problem. I've installed the following library versions:


  • libxml2 : 2.7.8

  • libxslt : 1.1.26



When I run the following code:

html = open('file.html', 'r')
context = etree.iterparse(StringIO(html), events=("start", "end"), html='true')
for event, element in context:
#do stuff


EDIT :



It turns out that it is a parsing error. I moved the html to a file(shown below)

<html>
<head></head>
<body>
<table>
<tr>
<td>image</td>
<a href="relative.phtml?with=querystring&blah=blah">blah\n(blah)</a></td>
<td> 35 </td>
<td> 28 </td>
<td><b>-7</b></td>
<td>
23,000 </td>
<td> 373,000 </td>
<td> 644,000 </td>
<td>+72.65%</td>
</tr>
<tr>
<td>image</td>
<td><a href="relative.phtml?with=querystring&blah=blah">blah\n(blah)</a></td>
<td> 35 </td>
<td> 28 </td>
<td><b>-7</b></td>
<td>
23,000 </td>
<td> 373,000 </td>
<td> 644,000 </td>
<td>+72.65%</td>
</tr>
</table>
</body>
</html>


I'm now getting this error:


for event, element in context:

File "iterparse.pxi", line 515, in lxml.etree.iterparse.next
(src/lxml/lxml.etree.c:86484) File "parser.pxi", line 565, in
lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084)
lxml.etree.XMLSyntaxError: error parsing attribute name, line 1,
column 12


ORIGIN ERROR:


for event, element in context:

File "iterparse.pxi", line 515, in lxml.etree.iterparse.next
(src/lxml/lxml.etree.c:86484) File "parser.pxi", line 565, in
lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084)
lxml.etree.XMLSyntaxError: htmlParseEntityRef: expecting ';', line 7,
column 71


I thought I followed the tutorial from lxml's site pretty closely here so I'm very confused. Could it be an installation problem?

Answer

The problem is that the HTML is malformed. To solve this, you can use BeautifulSoup (it's able to parse this HTML) or sanitize the HTML before trying to parse it.

The problems I've found are:

  • Ampersand should be escaped as an HTML entity in links: & => &amp;
  • Closing td tag after first a tag has to be removed since it doesn't match any other opening td tag.
Comments