git-e git-e - 1 month ago 8
Python Question

Get text in neted html tags with regex in Python

I have a text with html tags:

<p><b>Name and LastName</b><br />
Work Title<br /><span class="text-spacer"></span>
</p>


I want to have text in this format:

Name and LastName - Work Title


This is my code in Python but doesn't works:

text = '<p><b>Name and LastName</b><br />
Work Title<br /><span class="text-spacer"></span>
</p>'
my_text = re.sub(r'</b><br />', ' - ', text)

Answer

I'd use a specialized tool for the job - an HTML Parser, like BeautifulSoup:

In [1]: from bs4 import BeautifulSoup

In [2]: data = """<p><b>Name and LastName</b><br />
    ...: Work Title<br /><span class="text-spacer"></span>
    ...: </p>"""

In [3]: soup = BeautifulSoup(data, "html.parser")

In [4]: soup.p.get_text(separator=" - ", strip=True)
Out[4]: u'Name and LastName - Work Title'

Note the use of separator argument - it allows to provide a custom separator between the child nodes while getting the text of the parent - pretty neat feature that fits your use case nicely.

Comments