Jimbo Jimbo - 1 month ago 9x
Python Question

Scrapy export csv without specifying in cmd

I understand how to export my scraped data in to a csv format via

scrapy crawl <spider_name> -o filename.csv

However I'd like to run my spider from a script and automatically write to csv (so I can use schedule to run the spider at particular times). How could I implement this into my code and where would it go? I.E would it go into pipeline or my actual spider assuming this can be done.


Scrapy uses pipelines to post process the data you have scraped. You can create a file called pipelines.py which contains the following code which exports your data into a folder exports. Here's some code that I use in one of my pip projects

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter, JsonItemExporter

class ExportData(object):
    def __init__(self):
        self.files = {}
        self.exporter = None

    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        raise NotImplementedError

    def spider_closed(self, spider):
        file_to_save = self.files.pop(spider)

    def process_item(self, item, spider):
        return item

class ExportJSON(ExportData):
    Exporting to export/json/spider-name.json file
    def spider_opened(self, spider):
        file_to_save = open('exports/%s.json' % spider.name, 'w+b')
        self.files[spider] = file_to_save
        self.exporter = JsonItemExporter(file_to_save)

class ExportCSV(ExportData):
    Exporting to export/csv/spider-name.csv file
    def spider_opened(self, spider):
        file_to_save = open('exports/%s.csv' % spider.name, 'w+b')
        self.files[spider] = file_to_save
        self.exporter = CsvItemExporter(file_to_save)

You can view the project code on github. You just need to add these class names in your scrapy settings correctly.