nakiya nakiya - 4 months ago 35
HTML Question

How to get an HTML file using Python?

I am not very familiar with Python. I am trying to extract the artist names (for a start :)) from the following page:

How do I retrieve the page? My two main concerns are; what functions to use and how to filter out useless links from the page?


Example using urlib and lxml.html:

import urllib
from lxml import html

url = ""
page = html.fromstring(urllib.urlopen(url).read())

for link in page.xpath("//a"):
    print "Name", link.text, "URL", link.get("href")

output >>
    [('Aathma Liyanage', 'athma.html'),
     ('Abewardhana Balasuriya', 'abewardhana.html'),
     ('Aelian Thilakeratne', 'aelian_thi.html'),
     ('Ahamed Mohideen', 'ahamed.html'),