nilanjanaLodh nilanjanaLodh - 4 months ago 13
Python Question

Double quotes inside single quotes inside an re expression (python)

I am new to python. I was going through a repository on gitHub , and I saw the following line of code to extract all URLs from a webpage. I understand Regular expressions and capture groups , but I don't understand why there are extra double quotation marks enclosed within the single quotation marks?

links = re.findall('"((http|ftp)s?://.*?)"', html)


That is, how is it different from the following code ?

links = re.findall('((http|ftp)s?://.*?)', html)


I tried experimenting and saw that only the first one matches the URL syntax correctly but the second one doesn't . But I don't understand why.

Any help is appreciated.

Thank you.

Answer

The double quotes are part of the regex. They ensure that the pattern only matches if it is actually surrounded by quotes; so foo bar http://whatever.com wouldn't match, but <a href="http://whatever.com"> will.

Note this is a really fragile way of doing things, though, since single quotes are also valid in HTML but wouldn't match the regex.