joycey joycey - 3 months ago 9
Python Question

Extracting text within tag with BeautifulSoup

<div>
<p class="tabbed" style="margin-top:2px;"><span class="tab"><strong>LANGUAGES</strong></span>Cantonese</p>
<p class="tabbed" style="margin-top:2px;"><span class="tab"></span>English</p>
<p class="tabbed" style="margin-top:2px;"><span class="tab"></span>Putonghua</p>
<p class="tabbed"><span class="tab"><strong>GENDER</strong></span>Male</p>
</div>


I would like to extract the "Male" in the 5th line but I don't know how to do it. Can anyone help?
I tried " gen = soup.find('span', class_='tab').string" but it doesn't work.

Answer

You can use the .findAll() method:

In [37]: from bs4 import BeautifulSoup

In [38]: soup = BeautifulSoup("""<div>
     ...: <p class="tabbed" style="margin-top:2px;"><span class="tab"><strong>LANGUAGES</strong></span>Cantonese</p>
     ...: <p class="tabbed" style="margin-top:2px;"><span class="tab"></span>English</p>
     ...: <p class="tabbed" style="margin-top:2px;"><span class="tab"></span>Putonghua</p>
     ...: <p class="tabbed"><span class="tab"><strong>GENDER</strong></span>Male</p>
     ...:    </div>""", "html")

In [39]: soup.find(lambda tag: tag.text.startswith('GENDER')).text[6:]
Out[39]: u'Male'