TomatoPeter TomatoPeter - 1 month ago 16
Python Question

scrapy request callback count less then request count

I want to get poem analyze by scrapy.

I get analyze by following steps:


  1. get the
    poem list url

  2. get the
    poem detail url

  3. get the
    poem analyze url



But I found request callback called count is less than request counts.
In my demo request count is 10 but callback is 8.

The following content is log:

2016-10-26 16:15:54 [scrapy] INFO: Scrapy 1.2.0 started (bot: poem)
2016-10-26 16:15:54 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'poem.spiders', 'SPIDER_MODULES': ['poem.spiders'], 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'BOT_NAME': 'poem'}
2016-10-26 16:15:54 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-10-26 16:15:54 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-26 16:15:54 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-26 16:15:54 [scrapy] INFO: Enabled item pipelines:
['poem.pipelines.PoemPipeline']
2016-10-26 16:15:54 [scrapy] INFO: Spider opened
2016-10-26 16:15:54 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
poem list count : 10
callback count : 1
item count : 1
callback count : 2
item count : 2
callback count : 3
item count : 3
callback count : 4
item count : 4
callback count : 5
item count : 5
callback count : 6
item count : 6
callback count : 7
item count : 7
callback count : 8
item count : 8
2016-10-26 16:15:55 [scrapy] INFO: Closing spider (finished)
2016-10-26 16:15:55 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5385,
'downloader/request_count': 20, // (10 * 2)
'downloader/request_method_count/GET': 20,
'downloader/response_bytes': 139702,
'downloader/response_count': 20,
'downloader/response_status_count/200': 20,
'dupefilter/filtered': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 26, 8, 15, 55, 416028),
'item_scraped_count': 8,
'log_count/INFO': 7,
'request_depth_max': 2,
'response_received_count': 20,
'scheduler/dequeued': 19,
'scheduler/dequeued/memory': 19,
'scheduler/enqueued': 19,
'scheduler/enqueued/memory': 19,
'start_time': datetime.datetime(2016, 10, 26, 8, 15, 54, 887101)}
2016-10-26 16:15:55 [scrapy] INFO: Spider closed (finished)


IDE: Pycharm

DEBUG: Terminal in Pycharm

Code:

############## spider.py ##############
import scrapy
from poem.items import PoemItem


class PoemSpider(scrapy.Spider):

name = 'poem'

analyze_count = 0

start_urls = ['http://so.gushiwen.org/type.aspx']

def parse(self, response):
# 1. get poem list url
poems = response.xpath("//div[@class='typeleft']/div[@class='sons']")
for poem in poems:
# 2. get poem detail url
poem_href = poem.xpath("./p[1]/a/@href").extract_first()
poem_url = response.urljoin(poem_href)
yield scrapy.Request(poem_url, callback=self.parse_poem)

def parse_poem(self, response):
## 3. get analyze url
analyze_href = response.xpath("//u[text()='%s']/parent::*/@href"%(u'赏析')).extract_first()
analyze_url = response.urljoin(analyze_href)
yield scrapy.Request(analyze_url, callback=self.parse_poem_analyze)

def parse_poem_analyze(self, response):
# print analyze callback called count
print "#####################################"
PoemSpider.analyze_count = PoemSpider.analyze_count + 1
print PoemSpider.analyze_count
poem = PoemItem()
yield poem

############## pipelines.py ##############
class PoemPipeline(object):

processcount = 0

def process_item(self, item, spider):
# print item count
print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>"
PoemPipeline.processcount = PoemPipeline.processcount + 1
print PoemPipeline.processcount
return item

Answer

Your log is missing stderr, however you can still see what happens by looking at the stats output

{'downloader/request_bytes': 5385,
 'downloader/request_count': 20,  // (10 * 2)
 'downloader/request_method_count/GET': 20,  
 'downloader/response_bytes': 139702,
 'downloader/response_count': 20,
 'downloader/response_status_count/200': 20,
 'dupefilter/filtered': 2, <---------------------------------------------
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 10, 26, 8, 15, 55, 416028),
 'item_scraped_count': 8,  <---------------------------------------------
 'log_count/INFO': 7,
 'request_depth_max': 2,
 'response_received_count': 20,
 'scheduler/dequeued': 19,
 'scheduler/dequeued/memory': 19,
 'scheduler/enqueued': 19,
 'scheduler/enqueued/memory': 19,
 'start_time': datetime.datetime(2016, 10, 26, 8, 15, 54, 887101)}

So two requests you make are being filtered out by dupefilter middleware which makes sure that all requests you make are unique (usually just has unique url).

Your problem seems to be unsafe url creation here:

analyze_href = response.xpath("//u[text()='%s']/parent::*/@href"%(u'赏析')).extract_first()
analyze_url = response.urljoin(analyze_href)

since analyze_href can be empty you'll end up with analyze_url == response.url and that gets filtered out since you just crawled it.
To avoid that check if analyze_href is not empty and only then make it into an url:

analyze_href = response.xpath("//u[text()='%s']/parent::*/@href"%(u'赏析')).extract_first()
if not analyze_href:
    logging.error("failed to find analyze_href for {}".format(response.url))
    return
analyze_url = response.urljoin(analyze_href)