Lars Nielsen Lars Nielsen - 28 days ago 14
Python Question

Scrapy crawl only internal links, including relative links

I need to use scrapy for crawl all internal web links of a page, such that all links on for instance www.stackovflow.com is crawled. This code sort of work:

extractor = LinkExtractor(allow_domains=self.getBase(self.startDomain))

for link in extractor.extract_links(response):
self.registerUrl(link.url)


However there is a small problem, all relative paths such as
/meta
or
/questions/ask
is not crawled as the does not contain the base domain
stackoverflow.com
. Any ideas how to fix this?

Answer

If I understand the question correctly you want to use scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware

Filters out Requests for URLs outside the domains covered by the spider.

This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any

domain in the list are also allowed. E.g. the rule www.example.org will also allow bob.www.example.org but not www2.example.com nor example.com.

When your spider returns a request for a domain not belonging to those covered by the spider, this middleware will log a debug message

similar to this one:

DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html>

To avoid filling the log with too much noise, it will only print one of these messages for each new domain filtered. So, for example,

if another request for www.othersite.com is filtered, no log message will be printed. But if a request for someothersite.com is filtered, a message will be printed (but only for the first request filtered).

If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests.

If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in

allowed domains.

My understanding is that the URLs are normalised before being filtered.

Comments