Geo Geo - 2 months ago 17
Python Question

Is it possible to run just through the item pipeline without crawling with Scrapy?

I have a

.jl
file with items I've scraped. I have another pipeline now which was not present when I was doing the scraping. Is it possible to run just the pipeline, and have it apply the new pipeline without doing the crawl/scrape again?

Answer

Quick answer: Yes.

To bypass the downloader while having other components of scrapy working, you could use a customized downloader middleware which returns Response objects in its process_request method. Check the details: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

But in your case I personally think you could use some simple code to download the .jl file from your local file system. A quick (and full) example:

# coding: utf8

import json
import scrapy


class SampleSpider(scrapy.Spider):

    name = 'sample_spider'
    start_urls = [
        'file:///tmp/some_file.jl',
    ]
    custom_settings = {
        'ITEM_PIPELINES': {
            'your_pipeline_here': 100,
        },
    }

    def parse(self, response):
        for line in response.body.splitlines():
            jdata = json.loads(line)
            yield jdata

Just replace '/tmp/some_file.jl' with your actual path to the file.