I need to use scrapy for crawl all internal web links of a page, such that all links on for instance www.stackovflow.com is crawled. This code sort of work:
extractor = LinkExtractor(allow_domains=self.getBase(self.startDomain))
for link in extractor.extract_links(response):
If I understand the question correctly you want to use scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware
Filters out Requests for URLs outside the domains covered by the spider.
This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any
domain in the list are also allowed. E.g. the rule www.example.org will also allow bob.www.example.org but not www2.example.com nor example.com.
When your spider returns a request for a domain not belonging to those covered by the spider, this middleware will log a debug message
similar to this one:
DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html> To avoid filling the log with too much noise, it will only print one of these messages for each new domain filtered. So, for example,
if another request for www.othersite.com is filtered, no log message will be printed. But if a request for someothersite.com is filtered, a message will be printed (but only for the first request filtered).
If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests. If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in
My understanding is that the URLs are normalised before being filtered.