ignorant ignorant - 1 month ago 6
Python Question

Can't crawl more than a few items per page

I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.

For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -

def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('@src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('@href').extract_first()
}


I got 6 products. I expect 66.

For URL https://www.renttherunway.com/products/dress with the following code -

def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/@src').extract_first(),
}


I got 12. I expect roughly 100.

Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.

I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.

I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.

What am I missing?

Answer

It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.

To figure this out usually what you want to do is

  • disable javascript and do scrapy view <url>
  • investigate the results
  • find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find the it.

if you do regex based search:

re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())

You'll get a huge json that contains all products and their information.

import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66

That gets a correct amount of products!
So in your parse you can do this:

def parse(self, response):
    for product in data['ProductResult']['Products']:
        # find main image
        image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
        yield {'image_url': image_url}
Comments