Jeff Jeff - 6 months ago 82
Python Question

Scrapy Pull Same Data from Multiple Pages

This is related to the previous question I wrote here. I am trying to pull the same data from multiple pages on the same domain. A small explanation, I'm trying to pull data like offensive yards, turnovers, etc from a bunch of different box scores on a main page. Pulling the data from individual pages is working properly as is generation of the urls but when I try to have the spider cycle through all of the pages nothing is returned. I've looked through many other questions people have asked and the documentation and I can't figure out what is not working. Code is below. Thanks to anyone who's able to help in advance.

import scrapy

from scrapy import Selector
from nflscraper.items import NflscraperItem

class NFLScraperSpider(scrapy.Spider):
name = "pfr"
allowed_domains = ['']
start_urls = [

def parse(self,response):
for href in response.xpath('//a[contains(text(),"boxscore")]/@href'):
item = NflscraperItem()
url = response.urljoin(href.extract())
request = scrapy.Request(url, callback=self.parse_dir_contents)
request.meta['item'] = item
yield request

def parse_dir_contents(self,response):
item = response.meta['item']
# Code to pull out JS comment -
extracted_text = response.xpath('//div[@id="all_team_stats"]//comment()').extract()[0]
new_selector = Selector(text=extracted_text[4:-3].strip())
# Item population
item['home_score'] = response.xpath('//*[@id="content"]/table/tbody/tr[2]/td[last()]/text()').extract()[0].strip()
item['away_score'] = response.xpath('//*[@id="content"]/table/tbody/tr[1]/td[last()]/text()').extract()[0].strip()
item['home_oyds'] = new_selector.xpath('//*[@id="team_stats"]/tbody/tr[6]/td[2]/text()').extract()[0].strip()
item['away_oyds'] = new_selector.xpath('//*[@id="team_stats"]/tbody/tr[6]/td[1]/text()').extract()[0].strip()
item['home_dyds'] = item['away_oyds']
item['away_dyds'] = item['home_oyds']
item['home_turn'] = new_selector.xpath('//*[@id="team_stats"]/tbody/tr[8]/td[2]/text()').extract()[0].strip()
item['away_turn'] = new_selector.xpath('//*[@id="team_stats"]/tbody/tr[8]/td[1]/text()').extract()[0].strip()
yield item


The subsequent requests you make are filtered as offsite, fix your allowed_domains setting:

allowed_domains = [''] 

Worked for me.