sparkandshine sparkandshine - 5 months ago 33
HTML Question

How can I remove all different script tags in BeautifulSoup?

I crawl a table from a web link and would like to rebuild a table by removing all script tags. Here are the source codes.

response = requests.get(url)
soup = BeautifulSoup(response.text)
table = soup.find('table')

for row in table.find_all('tr') :
for col in row.find_all('td'):
#remove all different script tags
#col.replace_with('')
#col.decompose()
#col.extract()
col = col.contents


How can I remove all different script tags? Take the follow cell as an exampple, which includes the tag
a
,
br
and
td
.

<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>


My expected result is:



Signal et Communication
Ingénierie Réseaux et Télécommunications

Answer

You are asking about get_text():

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string

td = soup.find("td")
td.get_text()

Note that .string would return you None in this case since td has multiple children:

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

Demo:

>>> from bs4 import BeautifulSoup
>>> 
>>> soup = BeautifulSoup(u"""
... <td><a href="http://www.irit.fr/SC">Signal et Communication</a>
... <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
... </td>
... """)
>>> 
>>> td = soup.td
>>> print td.string
None
>>> print td.get_text()
Signal et Communication
Ingénierie Réseaux et Télécommunications
Comments