Mace Mace - 1 year ago 129
Python Question

Following hyperlink and "Filtered offsite request"

I know that there are several related threads out there, and they have helped me a lot, but I still can't get all the way. I am at the point where running the code doesn't result in errors, but I get nothing in my

file. I have the following
spider that starts on one webpage, then follows a hyperlink, and scrapes the linked page:

from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class bbrItem(Item):
Year = Field()
AppraisalDate = Field()
PropertyValue = Field()
LandValue = Field()
Usage = Field()
LandSize = Field()
Address = Field()

class spiderBBRTest(BaseSpider):
name = 'spiderBBRTest'
allowed_domains = [""]
start_urls = [',etage-a,side-a&gade=Septembervej&hus_nr=29&ipostnr=2730']

def parse2(self, response):
hxs = HtmlXPathSelector(response)
bbrs2 ="id('evaluationControl')/div[2]/div")
bbrs = iter(bbrs2)
for bbr in bbrs:
item = bbrItem()
item['Year'] ="table/tbody/tr[1]/td[2]/text()").extract()
item['AppraisalDate'] ="table/tbody/tr[2]/td[2]/text()").extract()
item['PropertyValue'] ="table/tbody/tr[3]/td[2]/text()").extract()
item['LandValue'] ="table/tbody/tr[4]/td[2]/text()").extract()
item['Usage'] ="table/tbody/tr[5]/td[2]/text()").extract()
item['LandSize'] ="table/tbody/tr[6]/td[2]/text()").extract()
item['Address'] = response.meta['address']
yield item

def parse(self, response):
hxs = HtmlXPathSelector(response)
PartUrl = ''.join("id('searchresult')/tr/td[1]/a/@href").extract())
url2 = ''.join(["", PartUrl])
yield Request(url=url2, meta={'address':"id('searchresult')/tr/td[1]/a[@href]/text()").extract()}, callback=self.parse2)

I am trying to export the results to a csv file, but I get nothing the file. Running the code, however, doesn't result in any errors. I know it's a simplyfied example with only one URL, but it illustrates my problem.

I think my problem could be that I am not telling
that I want to save the data in the

BTW, I run the spider as
scrapy crawl spiderBBR -o scraped_data.csv -t csv

Answer Source

You need to modify your yielded Request in parse to use parse2 as its callback.

EDIT: allowed_domains shouldn't include the http prefix eg:

allowed_domains = [""]

Try that and see if your spider still runs correctly instead of leaving allowed_domains blank

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download