S. Paw S. Paw - 3 months ago 10
Python Question

Python continue not working

I have the following python code:

if tag=='a':
for (key, value) in attrs:
if key=='rel':
if value=='nofollow':
continue
if key=='href':
# We are grabbing the new URL. We are also adding the
# base URL to it.
# We combine a relative URL with the base URL to create
# an absolute URL.
if not value.startswith('#'):
newUrl = parse.urljoin(self.baseUrl, value.rstrip('/'))
if newUrl not in self.links:
self.links = self.links + [newUrl]


I am trying to make it so that if the link is a nofollow link it skips that link and doesn't add it to the self.links array.

However it is being added and then eventually being put in my database, which I don't want.

Do I need something else instead of continue or am I just...lost?

Answer

This probably does what you want:

if tag=='a':
    # only if there's no rel="nofollow"
    if not any(key == 'rel' and value == 'nofollow' for key, value in attrs):
        for (key, value) in attrs:
            if key=='href':
                # We are grabbing the new URL. We are also adding the
                # base URL to it.
                # We combine a relative URL with the base URL to create
                # an absolute URL.
                if not value.startswith('#'):
                    newUrl = parse.urljoin(self.baseUrl, value.rstrip('/'))
                    if newUrl not in self.links:
                        self.links = self.links + [newUrl]

EDIT

Copying my explanation from a comment above as to why the existing code doesn't do what you want:

The continue only skips that key/value pair. So if you have something like <a rel="nofollow" href="foo">, you go through the loop and see rel=nofollow and continue, and then you see href=foo and process the URL.

Comments