I am trying to access the article content from a website, using beautifulsoup with the below code:
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
content = soup.find_all('p')
the content object contains all of the main text from the page that is within the 'p' tag, however there are still other tags present within the output as can be seen in the image below. I would like to remove all characters that are enclosed in matching pairs of < > tags and the tags themselves. so that only the text remains.
I have tried the following method, but it does not seem to work.
' '.join(item for item in content.split() if not (item.startswith('<') and item.endswith('>')))
What is the best way to remove substrings in a sting? that begin and end with a certain pattern such as < >