BernardoFire BernardoFire - 9 months ago 53
Python Question

Don't put html, head and body tags automatically, beautifulsoup

using beautifulsoup with html5lib, it puts the html, head and body tags automatically:

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>

is there any option that I can set, turn off this behavior ?

In [35]: import bs4 as bs

In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>

This parses the HTML with Python's builtin HTML parser. Quoting the docs:

Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a <body> tag. Unlike lxml, it doesn’t even bother to add an <html> tag.

Alternatively, you could use the html5lib parser and just select the element after <body>:

In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')

In [62]:
Out[62]: <h1>FOO</h1>