Jarratt Jarratt - 4 months ago 15
Javascript Question

Reload page if error "IndexError: list index out of range" occures

I am scraping web pages and sometimes the age does not load correctly and the error occurs


IndexError: list index out of range


This is because with the page not loading correctly it does not have the index. reloading the page solves this.

Is there away to add in error handling so if page is not loaded and the error occurs... reload the page?

I have searched the internet and cannot find anything

for link in links:

#print('Fetching from link: ' + link)
browser.get('http://www.racingpost.com' + link)
time.sleep(5)
print('http://www.racingpost.com' + link)
tree = html.fromstring(browser.page_source)
#print(browser.page_source)
if count == 0:
browser.find_element_by_xpath("//*[@id='re_']/div[2]/a[1]").click()
browser.find_element_by_xpath("//*[@id='re_']/div[2]/a[2]").click()
count = count + 1

#first of all pull all the data about the event its self like going distance ect
title = tree.xpath('//*[@id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/h3/text()[2]')
title = map(lambda x:x.strip(),title)
title = [x.strip(' ') for x in title]
details = tree.xpath('//*[@id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/ul/li[1]/text()[1]')
prizemoney = tree.xpath('//*[@id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/ul/li[2]/text()[1]')
setoff = tree.xpath('//*[@id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/h3/span/text()')
course = tree.xpath('//*[@id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/h1/text()[1]')
print(course)
course[0] = course[0].replace('Result', '')
date = tree.xpath('//*[@id="mainwrapper"]/div/div/div[2]/div[1]/div[2]/h1/text()[2]')
timeoff = tree.xpath('//div[@class="raceInfo"]/text()[3]')


above is a code snippit -> if borwser.get does not grab page (server rejects or timeout) then id need to retry.

Answer

I think you need to make a little refactoring. It should be something like this:

def get_page(link):
   # all code stuff for fetching page
   # this code could return ether error code or throw Exception
   return data

for link in links:
  try:
     result = get_page(link)
     # here you need to add this result 
  except IndexError:
     #log this error
     result = get_page(link) #this is retry. you can add slip() here too

This is quick and dirty example, you can improve it with better retries logging, counting retries for each link globally and so on.