init init - 6 months ago 12
HTML Question

python beautifulsoup no link when parsing 'a' tag and href

Apologies if there is a duplicate, I searched but couldn't find an answer.
I was writing a scraper to scrape a default directory index page served by my webserver. The html looks like this

<html>
<head><title>Index of /Mysongs</title></head>
<body bgcolor="white">
<h1>Index of /Mysongs</h1><hr><pre><a href="../">../</a>
<a href="Mysong1.mkv">Mysong1.mp3</a> 10-May-2016 07:24 183019
<a href="Mysong2.mkv">Mysong2.ogg</a> 10-May-2016 07:27 177205


The
href
link looks like a text only, and not a url (
<a href="Mysong2.mkv">
), but on pointing to the text, it shows the link in the browser's status bar (
http://127.0.0.1/Mysongs/Mysong2.ogg
)

I tried to extract the url using beautifulsoup, like this

#!/usr/bin/python

import httplib2
import sys
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request(sys.argv[1])
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
print link.get('href')


and I am not able to get the link like
http://127.0.0.1/Mysongs/Mysong2.ogg
, but only
<a href="Mysong1.mkv">Mysong1.mp3</a> 10-May-2016 07:24


Should I be using the
sys.argv[1]
to construct the href link like

print sys.argv[1] + link.get('href')


or is there some better way to get this?

Edit:: Current output is

Mysong1.mp3
Mysong2.ogg


Expected output:

http://127.0.0.1/Mysong1.mp3
http://127.0.0.1/Mysong1.0gg

Answer

Yes your only option is to add the base url. But don't add it this way:

print sys.argv[1] + link.get('href')

Use this:

from urlparse import urljoin
urljoin('http://something.com/random/abc.html', '../../music/MySong.mp3')

In your method, the relative paths may not be identified & handled, urljoin handles it.