sboss sboss - 1 month ago 16
Python Question

Scrapy, only follow internal URLS but extract all links found

I want to get all external links from a given website using Scrapy. Using the following code the spider crawls external links as well:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from myproject.items import someItem

class someSpider(CrawlSpider):
name = 'crawltest'
allowed_domains = ['someurl.com']
start_urls = ['http://www.someurl.com/']

rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
)

def parse_obj(self,response):
item = someItem()
item['url'] = response.url
return item


What am I missing? Doesn't "allowed_domains" prevent the external links to be crawled? If I set "allow_domains" for LinkExtractor it does not extract the external links. Just to clarify: I wan't to crawl internal links but extract external links. Any help appriciated!

Answer

You can also use the link extractor to pull all the links once you are parsing each page.

The link extractor will filter the links for you. In this example the link extractor will deny links in the allowed domain so it only gets outside links.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LxmlLinkExtractor
from myproject.items import someItem

class someSpider(CrawlSpider):
  name = 'crawltest'
  allowed_domains = ['someurl.com']
  start_urls = ['http://www.someurl.com/']

  rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow=(),deny = self.allowed_domains).extract_links(response):
        item = someItem()
        item['url'] = link.url