Sarien Sarien - 27 days ago 6
Python Question

How do I parse HTML-like with errors?

I have data that looks like it is part of an HTML document. However there are some bugs in it like

<td class= foo"bar">


on which all the parsers I tried (lxml, xml.etree) fail with an error.

Since I don't actually care about this specific part of the document I am looking for a more robust parser.

Something where I can allow errors in specific subtrees to be ignored and maybe just not insert the nodes or something that will only lazily parse the parts of the tree I am traversing for example.

Answer

You are using XML parsers. XML is a strict language, while the HTML standard requires parsers to be tolerant of errors.

Use a compliant HTML parser like lxml.html, or html5lib, or the wrapper library BeautifulSoup (which uses either of the previous with a cleaner API). html5lib is slower but closely mimics how a modern browser would treat errors.

Comments