I need to extract csv file from html page see below and once I get that I can do stuff with it. below is code to extract that particular line of html code from a previous assignment. The url is 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
that is test code so it breaks temporarly when it finds that line.
part of the line with my csv is href is csv/datasets/co2.csv ( unicode I think as type)
how to open the co2.csv?
sorry about any formatting issues with the question. The code has been sliced and diced by the editor.
url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
#. Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
y= (tag.get('href', None))
if y == 'csv/datasets/co2.csv':
c= c+ 1
if c is k:
for w in range(29):
You're re-downloading and reparsing the full html page for all of the 30 iterations of your loop, just to get the next csv file and see if that is the one you want. That is very inefficient, and not very polite to the server. Just read the html page once, and use the loop over the tags you already had to check if the tag is the one you want! If so, do something with it, and stop looping to avoid needless further processing because you said you only needed one particular file.
The other issue related to your question is that in the html file the csv hrefs are relative urls. So you have to join them on the base url of the document they're in.
urlparse.urljoin() does just that.
Not related to the question directly, but you should also try to clean up your code;
Resulting in something like:
import urllib import urlparse url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html' from BeautifulSoup import * def scraper(url): html = urllib.urlopen(url).read() soup = BeautifulSoup(html) # Retrieve all of the anchor tags tags = soup('a') for tag in tags: href = (tag.get('href', None)) if href.endswith("/co2.csv"): csv_url = urlparse.urljoin(url, href) # ... do something with the csv file.... contents = urllib.urlopen(csv_url).read() print "csv file size=", len(contents) break # we only needed this one file, so we end the loop. scraper(url)