Frank Frank - 5 months ago 13
Python Question

Get Certain Tags Within Parent Tag Using Beautifulsoup4

I am using beautifulsoup4 with Python to scrape content from the web, with which I am attempting to extract content from specific html tags, while ignoring others.

I have the following html:

<div class="the-one-i-want">
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<div class="random-inserted-element-i-dont-want">
<content>
</div>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
</div>


My goal is to understand how to instruct python to only get the
<p>
elements from within the parent
<div> class="the-one-i-want">
, otherwise ignoring all the
<div>
's within.

Currently, I am locating the content of the parent div by the following method:

content = soup.find('div', class_='the-one-i-want')


However, I can't seem to figure out how to further specify to only extract the
<p>
tags from that without error.

Answer
h = """<div class="the-one-i-want">
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <div class="random-inserted-element-i-dont-want">
        <content>
    </div>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
</div>"""

You can just use find_all("p") after you find:

from bs4 import BeautifulSoup
soup = BeautifulSoup(h)

print(soup.find("div","the-one-i-want").find_all("p"))

Or use a css select:

print(soup.select("div.the-one-i-want p"))

Both will give you:

[<p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>]

find_all will only find descendants of the div with the class the-one-i-want, the same applies to our select