Nicolae Surdu Nicolae Surdu - 1 year ago 103
Python Question

Python: How to resolve URLs containing '..'

I need to uniquely identify and store some URLs. The problem is that sometimes they come containing ".." like
which basically is
if I'm not wrong.

Is there a Python function or a tricky way to resolve this URLs ?

Answer Source

There’s a simple solution using urlparse.urljoin:

>>> import urlparse
>>> urlparse.urljoin('', '.')

However, if there is no trailing slash (the last component is a file, not a directory), the last component will be removed.

This fix uses the urlparse function to extract the path, then use (the posixpath version of) os.path to normalize the components. Compensate for a mysterious issue with trailing slashes, then join the URL back together. The following is doctestable:

import urlparse
import posixpath

def resolveComponents(url):
    >>> resolveComponents('')
    >>> resolveComponents('')

    parsed = urlparse.urlparse(url)
    new_path = posixpath.normpath(parsed.path)
    if parsed.path.endswith('/'):
        # Compensate for issue1707768
        new_path += '/'
    cleaned = parsed._replace(path=new_path)
    return cleaned.geturl()
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download