init init - 1 year ago 60
HTML Question

python beautifulsoup no link when parsing 'a' tag and href

Apologies if there is a duplicate, I searched but couldn't find an answer.
I was writing a scraper to scrape a default directory index page served by my webserver. The html looks like this

<head><title>Index of /Mysongs</title></head>
<body bgcolor="white">
<h1>Index of /Mysongs</h1><hr><pre><a href="../">../</a>
<a href="Mysong1.mkv">Mysong1.mp3</a> 10-May-2016 07:24 183019
<a href="Mysong2.mkv">Mysong2.ogg</a> 10-May-2016 07:27 177205

link looks like a text only, and not a url (
<a href="Mysong2.mkv">
), but on pointing to the text, it shows the link in the browser's status bar (

I tried to extract the url using beautifulsoup, like this


import httplib2
import sys
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request(sys.argv[1])
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
print link.get('href')

and I am not able to get the link like
, but only
<a href="Mysong1.mkv">Mysong1.mp3</a> 10-May-2016 07:24

Should I be using the
to construct the href link like

print sys.argv[1] + link.get('href')

or is there some better way to get this?

Edit:: Current output is


Expected output:

Answer Source

Yes your only option is to add the base url. But don't add it this way:

print sys.argv[1] + link.get('href')

Use this:

from urlparse import urljoin
urljoin('', '../../music/MySong.mp3')

In your method, the relative paths may not be identified & handled, urljoin handles it.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download