V Y V Y - 1 year ago 132
HTML Question

Using python3, what is the fastest way to extract all <div> blocks from a html str?

there are

inner blocks inside a
What is the fastest way to extract all
blocks from a html str ?
(bs4, lxml or regex ?)

Answer Source

lxml is generally considered to be the fastest among existing Python parsers, though the parsing speed depends on multiple factors starting with the specific HTML to parse and ending with the computational power you have available. For HTML parsing use the lxml.html subpackage:

from lxml.html import fromstring, tostring

data = """my HTML string"""
root = fromstring(data)

print([tostring(div) for div in root.xpath(".//div")]) 
print([div.text_content() for div in root.xpath(".//div")]) 

There is also the awesome BeautifulSoup parser which, if allowed to use lxml under-the-hood, would be a great combination of convenience, flexibility and speed. It would not be generally faster than pure lxml, but it comes with one of the best APIs I've ever seen allowing you to "view" your XML/HTML from different angles and use a huge variety of techniques:

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, "lxml")
print([str(div) for div in soup.find_all("div")])
print([div.get_text() for div in soup.find_all("div")])

And, I personally think, there is rarely a case when regex is suitable for HTML parsing:

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download