I am trying to write a function as follows:
#here there should be some code that
#iterates through the urls and create
#a dictionary where the keys are the
#respective urls and their values are
#a list of the possible extentions. The
#function should return that dictionary.
In: site = 'http://example.com'
In: site = 'http://example.com/page'
I'm interpreting your question as "Given a URL, find the set of URLs that exist "below" that URL." - if that's not correct, please update your question, it's not very clear.
It is not possible to discover the entire set of valid paths on a domain, your only option would be to literally iterate over every valid character, e.g.
/aa, .... and visit each of these URLs to determine if the server returns a 200 or not. I hope it's obvious this is simply not feasible.
It is possible (though there are caveats, and the website owner may not like it / block you) to crawl a domain by visiting a predefined set of pages, scraping all the links out of the page, following those links in turn, and repeating. This is essentially what Google does. This will give you a set of "discover-able" paths on a domain, which will be more or less complete depending on how long you crawl for, and how vigorously you look for URLs in their pages. While more feasible, this will still be very slow, and will not give you "all" URLs.
What problem exactly are you trying to solve? Crawling whole websites is likely not the right way to go about it, perhaps if you explain a little more your ultimate goal, we can help identify a better course of action than what you're currently imagining.
The underlying issue is there isn't necessarily any clear meaning of an "extension" to a URL. If I run a website (whether my site lives at
http://example.com/page/ doesn't matter) I can trivially configure my server to respond successfully to any request you throw at it. It could be as simple as saying "every request to
Hello World." and all of a sudden I have an infinite number of valid pages. Web servers and URLs are similar, but fundamentally not the same as hard drives and files. Unlike a hard drive which holds a finite number of files, a website can say "yes that path exists!" to as many requests as it likes. This makes getting "all possible" URLs impossible.
Beyond that, webservers often don't want you to be able to find all valid pages - perhaps they're only accessible if you're logged in, or at certain times of day, or to requests coming from China - there's no requirement that a URL always exist, or that the webserver tell you it exists. I could very easily put my infinite-URL behavior below
http://example.com/secret/path/no/one/knows/about/.* and you'd never know it existed unless I told you about it (or you manually crawled all possible URLs...).
So the long story short is: No, it is not possible to get all URLs, or even a subset of them, because there could theoretically be an infinite number of them, and you have no way of knowing if that is the case.
if I can add restrictions, that will make it easier!
I understand why you think this, but unfortunately this is not actually true. Think about URLs like regular expressions. How many strings match the regular expression
.*? An infinite number, right? How about
/path/.*? Less? Or
/path/that/is/long/and/explicit/.*? Counter intuitive though it may seem, there are actually no fewer URLs that match the last case than the first.
Now that said, my answer up to this point has been about the general case, since that's how you posed the question. If you clearly define and restrict the search space, or loosen the requirements of the question, you can get an answer. Suppose you instead said "Is it possible to get all URLs that are listed on this page and match my filter?" then the answer is yes, absolutely. And in some cases (such as Apache's Directory Listing behavior) this will coincidentally be the same as the answer to your original question. However there is no way to guarantee this is actually true - I could perfectly easily have a directory listing with secret, unlisted URLs that still match your pattern, and you wouldn't find them.