Lars Nielsen Lars Nielsen - 1 year ago 163
Python Question

Scrapy crawl only internal links, including relative links

I need to use scrapy for crawl all internal web links of a page, such that all links on for instance is crawled. This code sort of work:

extractor = LinkExtractor(allow_domains=self.getBase(self.startDomain))

for link in extractor.extract_links(response):

However there is a small problem, all relative paths such as
is not crawled as the does not contain the base domain
. Any ideas how to fix this?

Answer Source

If I understand the question correctly you want to use scrapy.spidermiddlewares.offsite.OffsiteMiddleware

Filters out Requests for URLs outside the domains covered by the spider.

This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any

domain in the list are also allowed. E.g. the rule will also allow but not nor

When your spider returns a request for a domain not belonging to those covered by the spider, this middleware will log a debug message

similar to this one:

DEBUG: Filtered offsite request to '': <GET>

To avoid filling the log with too much noise, it will only print one of these messages for each new domain filtered. So, for example,

if another request for is filtered, no log message will be printed. But if a request for is filtered, a message will be printed (but only for the first request filtered).

If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests.

If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in

allowed domains.

My understanding is that the URLs are normalised before being filtered.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download