Cliff Cliff - 27 days ago 14
Python Question

how to extract specific csv from web page html containing multiple csv file links

I need to extract csv file from html page see below and once I get that I can do stuff with it. below is code to extract that particular line of html code from a previous assignment. The url is 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
that is test code so it breaks temporarly when it finds that line.
part of the line with my csv is href is csv/datasets/co2.csv ( unicode I think as type)

how to open the co2.csv?
sorry about any formatting issues with the question. The code has been sliced and diced by the editor.

import urllib
url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *

def scrapper(url,k):
c=0
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
#. Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
y= (tag.get('href', None))
#print ((y))
if y == 'csv/datasets/co2.csv':
print y
break
c= c+ 1

if c is k:
return y
print(type(y))

for w in range(29):
print(scrapper(url,w))

Answer

You're re-downloading and reparsing the full html page for all of the 30 iterations of your loop, just to get the next csv file and see if that is the one you want. That is very inefficient, and not very polite to the server. Just read the html page once, and use the loop over the tags you already had to check if the tag is the one you want! If so, do something with it, and stop looping to avoid needless further processing because you said you only needed one particular file.

The other issue related to your question is that in the html file the csv hrefs are relative urls. So you have to join them on the base url of the document they're in. urlparse.urljoin() does just that.

Not related to the question directly, but you should also try to clean up your code;

  • fix your indentation (the comment on line 9)
  • choose better variable names; y/c/k/w are meaningless.

Resulting in something like:

import urllib
import urlparse

url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *


def scraper(url):
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html)
    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        href = (tag.get('href', None))
        if href.endswith("/co2.csv"):
            csv_url = urlparse.urljoin(url, href)
            # ... do something with the csv file....
            contents = urllib.urlopen(csv_url).read()
            print "csv file size=", len(contents)
            break   # we only needed this one file, so we end the loop.

scraper(url)