data_garden data_garden - 2 months ago 20
Python Question

Scrapy - iterate over object

this is how I'm running

scrapy
from a
Python
script:

def iterate():

process = CrawlerProcess(get_project_settings())

tracks = process.crawl('pitchfork_tracks', domain='pitchfork.com')

process.start()


however, I can't seem to
iterate
through the
response
, which is a a
dict
in this fashion:

{'track': [u'\u201cAnxiety\u201d',
u'\u201cLockjaw\u201d [ft. Kodak Black]',
u'\u201cMelanin Drop\u201d',
u'\u201cDreams\u201d',
u'\u201cIntern\u201d',
u'\u201cYou Don\u2019t Think You Like People Like Me\u201d',
u'\u201cFirst Day Out tha Feds\u201d',
u'\u201cFemale Vampire\u201d',
u'\u201cGirlfriend\u201d',
u'\u201cOpposite House\u201d',
u'\u201cGirls @\u201d [ft. Chance the Rapper]',
u'\u201cI Am a Nightmare\u201d']}


how do I
iterate
through this
response
? To my knowledge, up to this point the response it is an
object
and thus non-iterable.

Answer

You should follow the work flow of Scrapy Framework. Spider handles how requests are built and responses are parsed. ItemPipeline handles how items are operated.

From your code:

tracks = process.crawl('pitchfork_tracks', domain='pitchfork.com')

pitchfork_tracks is a spider name in your project. So you should handle the iteration on response in the spider, and do further operation in itempipeline. For the ItemPipeline part, you need to manually configure the settings of scrapy script. Check the docs for running scrapy from script approaches common practice-run from script

By the way, according to the docs CrawlerProcess,

tracks = process.crawl('pitchfork_tracks', domain='pitchfork.com')

tracks is a twist defer object, and this object is not iterable. Unless you are familiar with twist and Scrapy internal part, you'd better follow work flow of Scrapy framework.

Thanks.

Comments