user371012 user371012 - 4 months ago 13
HTML Question

how can I get href links from html code

import urllib2

website = "WEBSITE"

openwebsite = urllib2.urlopen(website)

html = getwebsite.read()

print html


so far so good.

But i want only href links from the plain text html.

How can i solve this problem?

Answer

Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

In case you just want links starting with http://, you should use:

soup.findAll('a', attrs={'href': re.compile("^http://")})