passwd passwd - 2 months ago 7
Python Question

Match URLs by file path and GET parameters (but not their values)

How can I check if any of my list of URLs match the given

url
? I need URLs to match only if all GET parameter names (not their values) and the path are the same. For example, I have this list:

links = [
"http://example.com/page.php?param1=111&param2=222",
"http://example.com/page2.php?param1=111&param2=222",
"http://example.com/page2.php?param1=111&param2=222&someParameterN=NumberN"
]

url = "http://example.com/page2.php?param1=NOT111&param2=NOT222"


This example is
True
because
url
matches
links[1]
. But how to match it in the most efficient way? I don't know what
url
will looks like.

Answer

You ideally want to use python's urlparse library. Parse your url like so:

import urlparse
url = "http://example.com/page2.php?param1=NOT111&param2=NOT222"
parsed_url = urlparse.urlparse(url)
urlparse.parse_qs(parsed_url.query).keys()

Then you create a datastructure which looks something like this:

seen_pages = set() # Stores all pages you've already seen.

And then all your pages to it like so:

for page in list_of_pages:
    parsed_url = urlparse.urlparse(page)
    current_page = (parsed_url.path, frozenset(urlparse.parse_qs(parsed_url.query).keys())
    seen_pages.add(current_page)

This stores all your pages in the form: tuple(link, set(param1,param2)) in a set.

To look up if you've already visited the page, with those exact parameters, simply create the current_page structure again and look it up in the set. Look up and addition to a set is an O(1) operation, that is, it is as fast as you can get.

Comments