Hyperion Hyperion - 7 days ago 6
Python Question

Python / BeautifoulSoup - Extract div content checking h1 text

I have an html page like this:

<div class="class1">
<div class="head">
<h1 class="title">Title 1</h1>
<div class="body">
<!-- some body content -->
</div>
</div>
</div>

<div class="class1">
<div class="head">
<h1 class="title">Title 2</h1>
<div class="body">
<!-- some body content -->
</div>
</div>
</div>


I need to extract content from the
div
with
class body
only if the title is equal to "
Title 2
". Since their parent containers doesn't have specific ids or classes, the h1 text is the only way to recognize what's all the div is about. At the moment I use this code:

from bs4 import BeautifoulSoup

# code to open the webpage
soup = BeautifulSoup(data, 'lxml')
body_content = soup.findAll('div', {'class':'class1'})[1]


But this isn't much elegant, since it supposes that the div I'm interested to is always the second in the page, it doesn't check for the title.

Answer

Well, the only solution I can think of is like below:

soup = BeautifulSoup(html,"html.parser")
    result_tags = soup.find_all(name='div',class_='class1')
    body_content = [tag for tag in result_tags if 'Title 2' in tag.prettify()][0]

It is better than your original codes since it does not assume your target div is the second one of the page.