I am new to python. I was going through a repository on gitHub , and I saw the following line of code to extract all URLs from a webpage. I understand Regular expressions and capture groups , but I don't understand why there are extra double quotation marks enclosed within the single quotation marks?
links = re.findall('"((http|ftp)s?://.*?)"', html)
links = re.findall('((http|ftp)s?://.*?)', html)
The double quotes are part of the regex. They ensure that the pattern only matches if it is actually surrounded by quotes; so
foo bar http://whatever.com wouldn't match, but
<a href="http://whatever.com"> will.
Note this is a really fragile way of doing things, though, since single quotes are also valid in HTML but wouldn't match the regex.