Nathaniel Ford Nathaniel Ford - 3 months ago 17
Python Question

What precisely does DEPTH_LIMIT refer to? Is the current depth referencable?

Scrapy indicates it has a

setting, but doesn't specifically say what it considers 'depth'. In terms of scraping pages, I've seen 'depth' refer to 'depth of the url', or
http://somedomain.com/this/is/a/depth/six/url
, where the page that is requested by that URL has a depth of 'six', because it's six segments in.
http://somedomain.com
would be depth zero.

On the other hand, when we consider scraping in terms of trees, depth would more likely refer to how far you are from the starting location. Thus, if I feed it a starting url of
http://somedomain.com/start/here
, that is depth zero, and any link found on that response would be depth one.

Does Scrapy use one of these definitions? If so which one? If it is the latter one (which seems the more logical), is there any way to get that depth information, either when you're processing the response in the crawler or when you're post-processing it as an item in the pipeline?

Answer

Scrapy uses a DFS approach for traversal and the current depth can be accessed via the response meta data: response.meta['depth'].