F. Esposito F. Esposito - 24 days ago 8
HTML Question

Regex with HTML tags and escaped characters

I have this text:



<h5 class="subblocksubhead subsubsectionhead first"><b>Messaggi inseriti</b></h5>
<dl class="blockrow stats">
<dt><b>Messaggi inseriti</b></dt>
<dd> 81</dd>
</dl>
<dl class="blockrow stats">
<dt>Media dei messaggi giornalieri</dt>
<dd> 0.02</dd>
</dl>


and I'm trying to extract the
" 81"
using this code:

regex_message_sent_num=r'Messaggi inseriti<.+>\n\t\t<.+?>(\s.+)<.+?>'
pattern_message_sent_num=re.compile(regex_message_sent_num)
results_message_sent_num=re.findall(pattern_message_sent_num,html_text)


I always get an empty list as output, whereas when I test the code here I get the right extraction.

Any idea what I'm doing wrong? The HTML comes from a webpage from which I'm trying to extract some visible data as exercise. I tested the regex on the HTML text saved from chrome browser.

Answer Source

Use an HTML Parser instead, like BeautifulSoup.

Using the text search and the find_next_sibling() method:

from bs4 import BeautifulSoup

data = """
<div>
    <dl class="blockrow stats">
        <dt><b>Messaggi inseriti</b></dt>
        <dd> 81</dd>
    </dl>
    <dl class="blockrow stats">
        <dt>Media dei messaggi giornalieri</dt>
        <dd> 0.02</dd>
    </dl>
</div>"""

soup = BeautifulSoup(data, "html.parser")

label = soup.find("dt", text="Messaggi inseriti")
print(label.find_next_sibling("dd").get_text(strip=True))

Prints 81.