erik7970 erik7970 - 1 year ago 73
Python Question

Using regular expressions to find URL not containing certain info

I'm working on a scraper/web crawler using Python 3.5 and the

module where one of its functions requires retrieving a YouTube channel's URL. I'm using the following portion of code that includes the matching of regular expression to accomplish this:

href = re.compile("(/user/|/channel/)(.+)")

What it should return is something like
. It does this successfully for the most part, but every now and then it grabs a type of URL that includes more information like
or something else that goes on after the

In an attempt to adress this issue, I rewrote the bit of code above as

href = re.compile("(/user/|/channel/)(?!(videos?view=60)(.+)")

along with other variations with no success. How can I rewrite my code so that it fetches URLS that do not include
anywhere in the URL?

Answer Source

Use the following approach with a specific regex pattern:

user_url = '/user/username/videos?view=60'
channel_url = '/channel/channelname/videos?view=60'

pattern = re.compile(r'(/user/|/channel/)([^/]+)')

m = re.match(pattern, user_url)
print(    # /user/username

m = re.match(pattern, channel_url)
print(    # /channel/channelname
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download