Michael Michael - 1 year ago 195
Python Question

Python urljoin not removing superflous dots

I'm using urljoin to get the absolute URL of links of a page. For the most part it does a good job at things like resolving relative links, but I notice that for some reason it does not remove superflous dots in some cases. For example:

>>> urljoin("http://x.com","http://x.com/../../X",False)
>>> urljoin("http://x.com","http://x.com/./../X",False)

If I give such an URL to a web browser, it corrects it fine, but if I try to use Python's urlopen() it throws an exception (urllib2.HTTPError: HTTP Error 400: Bad Request).

Is this expected behavior? Is there some other Python function that correctly removes these dots that I should be using instead, or is this a bug?

Answer Source

I think you should use an absolute base and a relative url.
If you call it like this, it removes the dots:

# result: 'http://x.com/index.html'

# result: 'http://x.com/a/b/index.html'

I found a way to normalize a url in this answer. Example:

urljoin('http://www.example.com/foo/bar/../../baz/bux/', '.')
# result: 'http://www.example.com/baz/bux/'

I think the invalid url handling (too much ..) can only be handled "manually", like this:

def remove_extra_dots(url):
    parsed = list(urlparse(url))
    dirs = []
    for name in parsed[2].split("/"):
        if name == "..":
            if len(dirs) > 1:
    parsed[2] = "/".join(dirs)
    return urlunparse(parsed)

This will eliminate all ..s from the url, even the invalid ones. Examples:

"http://x.com/a/b/c/../../X"  #->  http://x.com/a/X
"http://x.com/a/b/../../X"    #->  http://x.com/X
"http://x.com/../../X"        #->  http://x.com/X
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download