lxml seems to add a default doctype when one is missing in the html document.
See this demo code:
parser = lxml.etree.HTMLParser(
d = lxml.html.fromstring(html, parser=parser)
docinfo = d.getroottree().docinfo
with_doctype = """
# This passes!
assert "DOCTYPE" in beautify(with_doctype)
no_doctype = """<html>
# This fails!
assert "DOCTYPE" not in beautify(no_doctype)
# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before
lxml is based on libxml2, which does this by default unless you pass the option, I believe. Code here.
I don't know if you can tell lxml to pass that option though.. libxml has python bindings that you could perhaps use directly but they seem really hairy.
EDIT: did some more digging and that option does appear in the lxml soure here. That option does exactly what you want but I'm not sure how to activate it yet, if it's even possible.
There is currently no way to do this in lxml, but I've created a Pull Request on lxml which adds a
default_doctype boolean to the
Once the code gets merged in, the parser needs to be created like so:
parser = lxml.etree.HTMLParser( strip_cdata=True, remove_blank_text=True, default_doctype=False, )
Everything else stays the same.