SMth80 SMth80 - 3 years ago 102
Python Question

How to restrain duplicate links from getting parsed?

I've written some script in python to scrape the next page links available in that webpage which is running well at this moment. The only issue with this scraper is it can't shake off duplicate links. Hope somebody will help me accomplish this. I've tried with:

import requests
from lxml import html

page_link = "https://yts.ag/browse-movies"

def nextpage_links(main_link):
response = requests.get(main_link).text
tree = html.fromstring(response)
for item in tree.cssselect('ul.tsc_pagination a'):
if "page" in item.attrib["href"]:
print(item.attrib["href"])

nextpage_links(page_link)


This is the partial image of what I'm getting:

enter image description here

Answer Source

You can use set for the purpose:

import requests
from lxml import html

page_link = "https://yts.ag/browse-movies"

def nextpage_links(main_link):
    links = set()
    response = requests.get(main_link).text
    tree = html.fromstring(response)
    for item in tree.cssselect('ul.tsc_pagination a'):
        if "page" in item.attrib["href"]:
            links.add(item.attrib["href"])
            print(item.attrib["href"])
    return links

nextpage_links(page_link)

You can also use scrapy which will by default restrict duplicates.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download