view raw
MichaelMaggs MichaelMaggs - 7 months ago 53
Python Question

Beautiful Soup: extracting tagged and untagged HTML text

As a novice with bs4 I'm looking for some help in working out how to extract the text from a series of webpage tables, one of which is like this:

<table style="padding:0px; margin:1px" width="715px">
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Name: </strong></span>
Tyto alba
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Order: </strong></span>
<td height="22" width="33%">
<span class="darkGreenText"><strong> Family: </strong></span>
<td height="22" width="66%" colspan="2">
<span class="darkGreenText"><strong> Status: </strong></span>
Least Concern

Desired output:

Name: Tyto alba
Order: Strigiformes
Family: Tytonidae
Status: Least Concern

I've tried using
as recommended (,
and also
( but I'm getting stuck as one part of the text I need is tagged and the second part is not. Any help would be appreciated.


It seems like what you want is to call get_text(strip=True)(docs) on the BeautifulSoup Tag. Assuming raw_html is the html you pasted above:

htmlSoup = BeautifulSoup(raw_html) for tag in'td'): print(tag.get_text(strip=True))

which prints:

Name:Tyto alba Order:Strigiformes Family:Tytonidae Status:Least Concern