khpeek khpeek - 1 year ago 83
Python Question

Scrapy feed output contains the expected output several times instead of just once

I've written a spider of which the sole purpose is to extract one number from, namely, the maximum number of pages from the pager at the bottom (e.g., the number 255 in the example below).

enter image description here

I managed to do this using the LinkExtractor based on the regular expression that URLs of these pages match. The spider is shown below:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from Funda.items import MaxPageItem

class FundaMaxPagesSpider(CrawlSpider):
name = "Funda_max_pages"
allowed_domains = [""]
start_urls = [""]

le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as

rules = (
Rule(le_maxpage, callback='get_max_page_number'),

def get_max_page_number(self, response):
links = self.le_maxpage.extract_links(response)
max_page_number = 0 # Initialize the maximum page number
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'): # Select only pages with a link depth of 3
page_number = int(link.url.split("/")[-2].strip('p')) # For example, get the number 10 out of the string ''
# if page_number > max_page_number:
# max_page_number = page_number # Update the maximum page number if the current value is larger than its previous value
max_page_number = max(page_numbers)
print("The maximum page number is %s" % max_page_number)
yield {'max_page_number': max_page_number}

If I run this with feed output by entering
scrapy crawl Funda_max_pages -o funda_max_pages.json
at the command line, the resulting JSON file looks like this:

{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257}

I find it strange that the dict is outputted 7 times instead of just once. After all, the
statement is outside of the
loop. Can anyone explain this behavior?

Answer Source
  1. Your spider goes to first start_url.
  2. Uses LinkExtractor to extract 7 urls.
  3. Downloads every one of those 7 urls and calls get_max_page_number on every one of those.
  4. For every url get_max_page_number returns a dictionary.
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download