data_garden data_garden - 2 months ago 10
Python Question

Scrapy - processing items with pipeline

I'm running

scrapy
from a
python script
.

I was told that in
scrapy
,
responses
are built in
parse()
and further processed in
pipeline.py
.

this is how my
framework
is set so far:

python script

def script(self):

process = CrawlerProcess(get_project_settings())

response = process.crawl('pitchfork_albums', domain='pitchfork.com')

process.start() # the script will block here until the crawling is finished


spiders

class PitchforkAlbums(scrapy.Spider):
name = "pitchfork_albums"
allowed_domains = ["pitchfork.com"]
#creates objects for each URL listed here
start_urls = [
"http://pitchfork.com/reviews/best/albums/?page=1",
"http://pitchfork.com/reviews/best/albums/?page=2",
"http://pitchfork.com/reviews/best/albums/?page=3"
]
def parse(self, response):

for sel in response.xpath('//div[@class="album-artist"]'):
item = PitchforkItem()
item['artist'] = sel.xpath('//ul[@class="artist-list"]/li/text()').extract()
item['album'] = sel.xpath('//h2[@class="title"]/text()').extract()

yield item


items.py

class PitchforkItem(scrapy.Item):

artist = scrapy.Field()
album = scrapy.Field()


settings.py

ITEM_PIPELINES = {
'blogs.pipelines.PitchforkPipeline': 300,
}


pipelines.py

class PitchforkPipeline(object):

def __init__(self):
self.file = open('tracks.jl', 'wb')

def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
for i in item:
return i['album'][0]


if I just
return item
in
pipelines.py
, I get data like so (one
response
for each
html
page):

{'album': [u'Sirens',
u'I Had a Dream That You Were Mine',
u'Sunergy',
u'Skeleton Tree',
u'My Woman',
u'JEFFERY',
u'Blonde / Endless',
u' A Mulher do Fim do Mundo (The Woman at the End of the World) ',
u'HEAVN',
u'Blank Face LP',
u'blackSUMMERS\u2019night',
u'Wildflower',
u'Freetown Sound',
u'Trans Day of Revenge',
u'Puberty 2',
u'Light Upon the Lake',
u'iiiDrops',
u'Teens of Denial',
u'Coloring Book',
u'A Moon Shaped Pool',
u'The Colour in Anything',
u'Paradise',
u'HOPELESSNESS',
u'Lemonade'],
'artist': [u'Nicolas Jaar',
u'Hamilton Leithauser',
u'Rostam',
u'Kaitlyn Aurelia Smith',
u'Suzanne Ciani',
u'Nick Cave & the Bad Seeds',
u'Angel Olsen',
u'Young Thug',
u'Frank Ocean',
u'Elza Soares',
u'Jamila Woods',
u'Schoolboy Q',
u'Maxwell',
u'The Avalanches',
u'Blood Orange',
u'G.L.O.S.S.',
u'Mitski',
u'Whitney',
u'Joey Purp',
u'Car Seat Headrest',
u'Chance the Rapper',
u'Radiohead',
u'James Blake',
u'White Lung',
u'ANOHNI',
u'Beyonc\xe9']}


what I would like to do in
pipelines.py
is to be able to fetch individual
songs
for each
item
, like so:

[u'Sirens']


please help?

Answer

I suggest that you build well structured item in spider. In Scrapy Framework work flow, spider is used to built well-formed item, e.g., parse html, populate item instances and pipeline is used to do operations on item, e.g., filter item, store item.

For your application, if I understand correctly, each item should be an entry to describe an album. So when paring html, you'd better build such kind of item, instead of massing everything into item.

So in your spider.py, parse function, you should

  1. Put yield item statement in the for loop, NOT OUTSIDE. In this way, each album will generate an item.
  2. Be careful about relative xpath selector in Scrapy. If you want to use relative xpath selector to specify self-and-descendant, use .// instead of //, and to specify self, use ./ instead of /.
  3. Ideally album title should be a scalar, album artist should be a list, so try extract_first to make album title to be a scalar.

    def parse(self, response):
    for sel in response.xpath('//div[@class="album-artist"]'):
        item = PitchforkItem()
        item['artist'] = sel.xpath('./ul[@class="artist-list"]/li/text()').extract_first()
        item['album'] = sel.xpath('./h2[@class="title"]/text()').extract()
        yield item
    

Hope this would be helpful.

Comments