goose goose - 4 months ago 15
Python Question

BeautifulSoup won't remove i element

I am learning how to parse and manipulate html using beautiful soup like so:

from lxml.html import parse
import urllib2
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

url = 'some-url-here'
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
parsed = urllib2.urlopen( req )
soup = BeautifulSoup(parsed)

for elem in soup.findAll(['script', 'style', 'i']):
elem.extract()

for main_body in soup.findAll("div", {"role" : "main"}):
print main_body.getText(separator=u' ')


The result contains tags and I can't figure out how to remove them. How can this be accomplished and why is the only tag not to be removed by the above code?

Answer

The issue is actually the fact you are using the deprecated Beautifulsoup3, install bs4 and everything will work fine:

In [10]: import urllib2
In [11]: from bs4 import BeautifulSoup # bs4

In [12]: url = 'https://www.gwr.com/'

In [13]: req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"})

In [14]: parsed = urllib2.urlopen(req)

In [15]: soup = BeautifulSoup(parsed,"html.parser")

In [16]: tags = soup.find_all(['script','style','i'])

In [17]: print(len(tags))
25

In [18]: for elem in tags:
   ....:         elem.extract()
   ....:     

In [19]: assert len(soup.find_all(['script','style','i'])) == 0

In [20]: