abcla abcla - 1 year ago 90
Python Question

Python, remove all html tags from string

I am trying to access the article content from a website, using beautifulsoup with the below code:

site= ''
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
content = soup.find_all('p')

the content object contains all of the main text from the page that is within the 'p' tag, however there are still other tags present within the output as can be seen in the image below. I would like to remove all characters that are enclosed in matching pairs of < > tags and the tags themselves. so that only the text remains.

I have tried the following method, but it does not seem to work.

' '.join(item for item in content.split() if not (item.startswith('<') and item.endswith('>')))

What is the best way to remove substrings in a sting? that begin and end with a certain pattern such as < >

enter image description here


You could use get_text()

for i in content:
    print i.get_text()

Example below is from the docs:

>>> markup = '<a href="">\nI linked to <i></i>\n</a>'
>>> soup = BeautifulSoup(markup)
>>> soup.get_text()
u'\nI linked to\n'