dufferZafar dufferZafar - 3 months ago 9
Python Question

How to prevent lxml from adding a default doctype

lxml seems to add a default doctype when one is missing in the html document.

See this demo code:

import lxml.etree
import lxml.html


def beautify(html):
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True
)

d = lxml.html.fromstring(html, parser=parser)
docinfo = d.getroottree().docinfo

return lxml.etree.tostring(
d,
pretty_print=True,
doctype=docinfo.doctype,
encoding='utf8'
)


with_doctype = """
<!DOCTYPE html>
<html>
<head>
<title>With Doctype</title>
</head>
</html>
"""

# This passes!
assert "DOCTYPE" in beautify(with_doctype)

no_doctype = """<html>
<head>
<title>No Doctype</title>
</head>
</html>"""

# This fails!
assert "DOCTYPE" not in beautify(no_doctype)

# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before





How can I tell lxml to not do this?

This issue was originally raised here:
https://github.com/mitmproxy/mitmproxy/issues/845

Quoting a comment on reddit as it might be helpful:


lxml is based on libxml2, which does this by default unless you pass the option
HTML_PARSE_NODEFDTD
, I believe. Code here.

I don't know if you can tell lxml to pass that option though.. libxml has python bindings that you could perhaps use directly but they seem really hairy.

EDIT: did some more digging and that option does appear in the lxml soure here. That option does exactly what you want but I'm not sure how to activate it yet, if it's even possible.

Answer

There is currently no way to do this in lxml, but I've created a Pull Request on lxml which adds a default_doctype boolean to the HTMLParser.

Once the code gets merged in, the parser needs to be created like so:

parser = lxml.etree.HTMLParser(
    strip_cdata=True,
    remove_blank_text=True,
    default_doctype=False,
)

Everything else stays the same.