Leonid Ivanov Leonid Ivanov - 6 days ago 6
Python Question

Scrapy crawls but not scrape

The problem is that if I add product url directly to "start_urls" everything works just fine. But when product page appear during crawl (all crawled pages returns '200') it doesn't scrape....
I'm running spider through:

scrape crawl armani_products -t csv -o Armani.csv


Spider code:

#-*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from armani.items import ArmaniItem
import datetime


class ArmaniProducts(CrawlSpider):
name = 'armani_products'
allowed_domains = ['www.armani.com']
start_urls = [
#'http://www.armani.com/us/giorgioarmani/sweater_cod39636734fs.html',
#'http://www.armani.com/us/giorgioarmani/sweater_cod39693703uh.html',
#'http://www.armani.com/us/giorgioarmani/pantaloni-5-tasche_cod36883777uu.html',
#'http://www.armani.com/fr/giorgioarmani/robe_cod34663996xk.html',
#'http://www.armani.com/fr/giorgioarmani/trousers_cod36898044mj.html',
'http://www.armani.com/us/giorgioarmani/women/onlinestore/suits-and-jackets',
]

rules = (
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('http://www.armani.com/us/giorgioarmani/', 'http://www.armani.com/fr/giorgioarmani/', )), follow=True),
Rule(LinkExtractor(allow=('.*_cod.*\.html', )), callback='parse_item'),
)

def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = ArmaniItem()
item['name'] = response.xpath('//h2[@class="productName"]/text()').extract()
item['price'] = response.xpath('//span[@class="priceValue"]/text()')[0].extract()
if response.xpath('//span[@class="currency"]/text()')[0].extract() == '$':
item['currency'] = 'USD'
else:
item['currency'] = response.xpath('//span[@class="currency"]/text()')[0].extract()
item['category'] = response.xpath('//li[@class="selected leaf"]/a/text()').extract()
item['sku'] = response.xpath('//span[@class="MFC"]/text()').extract()
if response.xpath('//div[@class="soldOutButton"]/text()').extract() == True or response.xpath('//span[@class="outStock"]/text()').extract() == True:
item['avaliability'] = 'No'
else:
item['avaliability'] = 'Yes'
item['time'] = datetime.datetime.now().strftime("%Y.%m.%d %H:%M")
item['color'] = response.xpath('//*[contains(@id, "color_")]/a/text()').extract()
item['size'] = response.xpath('//*[contains(@id, "sizew_")]/a/text()').extract()
if '/us/' in response.url:
item['region'] = 'US'
elif '/fr/' in response.url:
item['region'] = 'FR'
item['description'] = response.xpath('//div[@class="descriptionContent"]/text()')[0].extract()
return item


What am I missing?

Answer

I've tested and it seems that this website blocks all non-standard User-Agents (by returning 403). So try setting user_agent class paremeter to something common like:

class ArmaniProducts(CrawlSpider):
    name = 'armani_products'
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0'

or just set it in projects settings.py:

USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0'

You can find more user-agent strings around the web, e.g. official mozzila docummentation

EDIT:
Upon further inspection I see that your LinkExtractor logic is faulty. Linkextractors extract in defined rule order and your extractors overflap, that means first linkextractor with follow also extract product pages as well which means the product link extractor you have will crawl pages that were already crawler previously and get dupe filtered.

You need to rework your first linkextractor to avoid product pages. You can do that just by copying the allow parameter from your linkextractor to deny parameter of your first linkextractor.

Comments