erik7970 erik7970 - 14 days ago 6
Python Question

Using regular expressions to find URL not containing certain info

I'm working on a scraper/web crawler using Python 3.5 and the

re
module where one of its functions requires retrieving a YouTube channel's URL. I'm using the following portion of code that includes the matching of regular expression to accomplish this:

href = re.compile("(/user/|/channel/)(.+)")


What it should return is something like
/user/username
or
/channel/channelname
. It does this successfully for the most part, but every now and then it grabs a type of URL that includes more information like
/user/username/videos?view=60
or something else that goes on after the
username/
portion.

In an attempt to adress this issue, I rewrote the bit of code above as

href = re.compile("(/user/|/channel/)(?!(videos?view=60)(.+)")


along with other variations with no success. How can I rewrite my code so that it fetches URLS that do not include
videos?view=60
anywhere in the URL?

Answer

Use the following approach with a specific regex pattern:

user_url = '/user/username/videos?view=60'
channel_url = '/channel/channelname/videos?view=60'

pattern = re.compile(r'(/user/|/channel/)([^/]+)')

m = re.match(pattern, user_url)
print(m.group())    # /user/username

m = re.match(pattern, channel_url)
print(m.group())    # /channel/channelname
Comments