dlpnewbie96 dlpnewbie96 - 3 months ago 9
Python Question

How to get the title of links in the html page using BeautifulSoup python?

Consider the html :

<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Special Exam_Aug_2016.pdf" target="_blank"> Student Notice </a></li>
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Bus_Route_Chart_Aug16.pdf" style="font-size:14px" target="_blank">UPDATED BUS ROUTE </a></li>
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Faculty_Requirement_Aug2016.jpg" target="_blank">Application are invited </a></li>


I want to extract the following titles and save them in a
list
:

Student Notice
UPDATED BUS ROUTE
Application are invited


How can I do it using
urllib2
and
BeautifulSoup
?

Answer

You don't need urllib if you already have the html...urllib is used to make a request to a web server which then returns the html, you can simply do this when you have the html

>>> from bs4 import BeautifulSoup
>>> a = """<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Special Exam_Aug_2016.pdf" target="_blank"> Student Notice </a></li>
... <li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Bus_Route_Chart_Aug16.pdf" style="font-size:14px" target="_blank">UPDATED BUS ROUTE </a></li>
... <li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Faculty_Requirement_Aug2016.jpg" target="_blank">Application are invited </a></li>"""
>>> b = BeautifulSoup(a, 'html.parser')
>>> c = b.find_all('li')

>>> for elem in c:
...     print(elem.a.string)
... 
 Student Notice 
UPDATED BUS ROUTE 
Application are invited