Jake Jake - 3 months ago 20
Python Question

How to add try exception in scrapy spider?

I build a simple crawler application by using urllib2 and beautifulsoup, now i am planning to change it into scrapy spider, but how i can handle errors while running crawler,
My current application have some code like this,

error_file = open('errors.txt','a')
finish_file = open('finishlink.txt','a')
try:
#Code for process each links
#if sucessfully finished link store into 'finish.txt' file
except Exception as e:
#link write into 'errors.txt' file with error code


so when i am processing thousands of links, the successfully processed links will store into finish.txt and error's will be in errors.txt, so i can run links in errors later until successfully processed.
So how i can accomplish these in these code,

class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

def parse(self, response):
filename = response.url.split("/")[-2]
with open('filename+'.txt', 'wb') as f:
f.write(response.body)

Answer

You can create a spider middleware and override the process_spider_exception() method, saving the links in a file there.

A spider middleware is just a way for you to extend Scrapy's behavior. Here is a full example that you can modify as needed for your purpose:

from scrapy import signals


class SaveErrorsMiddleware(object):
    def __init__(self, crawler):
        crawler.signals.connect(self.close_spider, signals.spider_closed)
        crawler.signals.connect(self.open_spider, signals.spider_opened)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def open_spider(self, spider):
        self.output_file = open('somefile.txt', 'a')

    def close_spider(self, spider):
        self.output_file.close()

    def process_spider_exception(self, response, exception, spider):
        self.output_file.write(response.url + '\n')

Put this in a module and set it up in settings.py:

SPIDER_MIDDLEWARES = {
    'myproject.middleware.SaveErrorsMiddleware': 1000,
}

This code will run together with your spider, triggering the open_spider(), close_spider(), process_spider_exception() methods when appropriated.

Read more:

Comments