JSkjold JSkjold - 3 months ago 14
Python Question

Trouble downloading xlsx file from website - Scraping

I'm trying to write some code which download the two latest publications of the Outage Weeks found at the bottom of http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/

It's xlsx-files, which I'm going to load into Excel afterwards.
It doesn't matter which programming language the code is written in.

My first idea was to use the direct url's, like http://www.eirgridgroup.com/site-files/library/EirGrid/Outage-Weeks_36(2016)-51(2016)_31%20August.xlsx
, and then make some code which guesses the url of the two latest publications.
But I have noticed some inconsistencies in the url names, so that solution wouldn't work.

Instead it might be solution to scrape the website and use the XPath to download the files. I found out that the two latest publications always have the following XPaths:

/html/body/div[3]/div[3]/div/div/p[5]/a
/html/body/div[3]/div[3]/div/div/p[6]/a


This is where I need help. I'm new to both XPath and Web Scraping. I have tried stuff like this in Python

from lxml import html
import requests

page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)

v = tree.xpath('/html/body/div[3]/div[3]/div/div/p[5]/a')


But v seems to be empty.

Any ideas would be greatly appreciated!

Answer

Just use contains to find the hrefs and slice the first two:

 tree.xpath('//p/a[contains(@href, "/site-files/library/EirGrid/Outage-Weeks")]/@href')[:2]

Or doing it all with the xpath using [position() < 3]:

tree.xpath'(//p/a[contains(@href, "site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/@href')

The files are ordered from latest to oldest so getting the first two gives you the two newest.

To download the files you just need to join each href to the base url and write the content to a file:

from lxml import html
import requests
import os
from urlparse import urljoin

page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)

v = tree.xpath('(//p/a[contains(@href, "/site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/@href')
for href in v:
    # os.path.basename(href) -> Outage-Weeks_35(2016)-50(2016).xlsx 
    with open(os.path.basename(href), "wb") as f:
        f.write(requests.get(urljoin("http://www.eirgridgroup.com", link)).content)