I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex.
Alright, so to answer your first question, I'll break down
The square brackets (
) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time.
[abc] will match the strings
c. In this case, your character class is negated using the caret (
^) at the beginning - this inverts its meaning, making it match anything but the characters in it.
\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.
*? is a little harder to explain. The
* quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The
?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet
[^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as
To answer the second part of your question, I'll compare the two regexes you give:
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (
:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (
/) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg http://foo bar.png
Which is equally illegal in valid URLs.