cstoddart cstoddart - 29 days ago 37
Python Question

Scrapy With Splash Only Scrapes 1 Page

I am trying to scrape multiple URLs, but for some reason only results for 1 site show. In every case it is the last URL in start_urls that is shown.

I believe I have the problem narrowed down to my parse function.

Any ideas on what I'm doing wrong?

Thanks!

class HeatSpider(scrapy.Spider):
name = "heat"

start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2']

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 8},
)

def parse(self, response):
for metric in response.css('.matrix-data'):
yield {
'City': response.css('title::text').extract_first(),
'Metric Data Title': metric.css('.title::text').extract_first(),
'Metric Data Price': metric.css('.price::text').extract_first(),
}


EDIT:

I have altered my code to help debug. After running this code, my csv looks like this: csv results
There is a row for every url, as there should be, but only one row is filled out with information.

class HeatSpider(scrapy.Spider):
name = "heat"

start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2']

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 8},
)


def parse(self, response):
yield {
'City': response.css('title::text').extract_first(),
'Metric Data Title': response.css('.matrix-data .title::text').extract(),
'Metric Data Price': response.css('.matrix-data .price::text').extract(),
'url': response.url,
}


EDIT 2:
Here is the full output http://pastebin.com/cLM3T05P
On line 46 you can see the empty cells

Answer

What worked for me was adding the delay between the requests:

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.

DOWNLOAD_DELAY = 5

Tested it on the 4 urls and got the results for all of them:

start_urls = [
    'https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
    'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
    'https://www.expedia.com/Hotel-Search?#&destination=washington&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
    'https://www.expedia.com/Hotel-Search?#&destination=philadelphia&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
]
Comments