sebsasto sebsasto - 22 days ago 6
Python Question

How to make a scrapy spider run multiple times from a tornado request

I have an Scrapy Spider that I need to run when a Tornado get request is called. The first time I called the Tornado Request, the spider runs ok, but when I make another request to the Tornado, the spider does not run and the following error is raised:

Traceback (most recent call last):
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/tornado/web.py", line 1413, in _execute
result = method(*self.path_args, **self.path_kwargs)
File "server.py", line 38, in get
process.start()
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
ReactorBase.startRunning(self)
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable


The tornado method is:

class PageHandler(tornado.web.RequestHandler):

def get(self):

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
})

process.crawl(YourSpider)
process.start()

self.write(json.dumps(results))


So the idea is that always that DirectoryHandler method is called, the spider runs and perform the crawling

Thanks for your help!

Answer

Well after googled a lot of time, finally I get the answer to solve this problem... There is a library scrapydo (https://github.com/darkrho/scrapydo) that is based on croched and block the reactor for you allowing the reuse of the same spider every time is needed.

So to solve the problem you need to install the library, then call the setup method one time and then use the run_spider method... The code is like:

import scrapydo
scrapydo.setup()


class PageHandler(tornado.web.RequestHandler):

def get(self):

    scrapydo.run_spider(YourSpider(), settings={
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
    })

    self.write(json.dumps(results))

Hope this could help anyone that have the same problem!