av abhishiek av abhishiek - 12 days ago 6
CSS Question

Extract only text from div tags on a page using Python and Beautiful Soup

I am trying to scrape a static news website as a project, I am using Beautiful soup , but I am stuck on a page which contains text in div tag , here text means the news article

the link for the site is
http://economictimes.indiatimes.com/magazines/panache/smoking-aces-chef-irshad-qureshis-interesting-stories-related-to-celebrities/articleshow/48712333.cms

The news text is contained in a below format

<html>
<body>
<div class="normal" id="foo">
" Many "
<a href ='/some link' target = 'blank'>Bollywood</a>
" stars today are avowed foodies "
<a href = 'link2'>Ranbir Kapoor</a>
" Alia Bhat "
</div>
</body>
</html>


The text I want is "Many Bollywood stars today are vowed foodies. Alia Bhat"

That is I want all the text wherever they are.

I was able to arrrive at div using find_all('div','normal'), but stuck how to retrieve all the text elements from page after that.

Please let me know if you want any more info.

Answer

To extract the text from some element in beautifulsoup you can use the .text attribute:

>>> t  = """<div class="normal" id="foo">  Many  <a href ='/some link' target = 'blank'>Bollywood</a>  stars today  are avowed foodies  <a href = 'link2'>Ranbir Kapoor</a>  Alia Bhat  </div>"""
>>> bs = BeautifulSoup(t)
>>> print(bs.find('div').text)
  Many  Bollywood  stars today  are avowed foodies  Ranbir Kapoor  Alia Bhat