PHA PHA - 1 year ago 84
Python Question

How many items has been scraped per start_url

I use scrapy to crawl 1000 urls and store scraped item in a mongodb. I'd to know how many items have been found for each url. From scrapy stats I can see

'item_scraped_count': 3500

However, I need this count for each start_url separately. There is also
field for each item that I might use to count each url items manually:

2016-05-24 15:15:10 [scrapy] DEBUG: Crawled (200) <GET> (referer:

But I wonder if there is a built-in support from scrapy.

Answer Source

challenge accepted!

there isn't something on scrapy that directly supports this, but you could separate it from your spider code with a Spider Middleware:

from scrapy.http.request import Request

class StartRequestsCountMiddleware(object):

    start_urls = {}

    def process_start_requests(self, start_requests, spider):
        for i, request in enumerate(start_requests):
            self.start_urls[i] = request.url
            yield request

    def process_spider_output(self, response, result, spider):
        for output in result:
            if isinstance(output, Request):
            yield output

Remember to activate it on

    'myproject.middlewares.StartRequestsCountMiddleware': 200,

Now you should be able to see something like this on your spider stats:

'start_requests/item_scraped_count/START_URL1': ITEMCOUNT1,
'start_requests/item_scraped_count/START_URL2': ITEMCOUNT2,
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download