Arion_Miles Arion_Miles - 6 months ago 70
Python Question

Twisted Reactor not restarting in scrapy

I'm trying to run a scrapy spider via a Telegram bot using the

API wrapper. Using the below code, I can successfully execute the spider and forward the scraped results to the bot, but only ONCE since I run the script. When I attempt to re-execute the spider via the bot (telegram bot command), I get the error

from twisted.internet import reactor
from scrapy import cmdline
from telegram.ext import Updater, CommandHandler, MessageHandler, Filters, RegexHandler
import logging
import os
import ConfigParser
import json
import textwrap
from MIS.spiders.moodle_spider import MySpider
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerRunner, CrawlerProcess
from scrapy.utils.log import configure_logging

# Read settings from config file
config = ConfigParser.RawConfigParser()'./spiders/creds.ini')
TOKEN = config.get('BOT', 'TOKEN')
#APP_NAME = config.get('BOT', 'APP_NAME')
#PORT = int(os.environ.get('PORT', '5000'))
updater = Updater(TOKEN)

# Setting Webhook
# port=PORT,
# url_path=TOKEN) + TOKEN)

logging.basicConfig(format='%(asctime)s -# %(name)s - %(levelname)s - %(message)s',level=logging.INFO)

dispatcher = updater.dispatcher

# Real stuff

def doesntRun(bot, update):
#process = CrawlerProcess(get_project_settings())

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner({
'FEED_FORMAT' : 'json',
'FEED_URI' : 'output.json'

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop()) # the script will block here until the crawling is finished

with open("./output.json", 'r') as file:
contents =
a_r = json.loads(contents)
AM = a_r[0]['AM']

message_template = textwrap.dedent("""
AM: {AM}
messageContent = message_template.format(AM=AM, ...)
#print messageContent
bot.sendMessage(chat_id=update.message.chat_id, text=messageContent)

# Handlers
test_handler = CommandHandler('doesntRun', doesntRun)

# Dispatchers


I'm using the code from the docs:

Code goes like this:

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
# Your spider definition

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop()) # the script will block here until the crawling is finished

Answer Source

Okay, I finally solved my problem.

the Python-telegram-bot API wrapper offers an easy way to restart the bot.

I simply put the lines:

os.execl(sys.executable, sys.executable, *sys.argv)

at the end of the doesntRun() function. Now whenever I call the function via bot, it scrapes the page, stores the results, forwards the result, then restarts itself. Doing so allows me to execute the spider any number of times I want.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download