Jonny Jonny - 6 months ago 17
Python Question

BeautifulSoup not downloading files as expected

I'm trying to download all the .txt files from this website with the following code:

from bs4 import BeautifulSoup as bs
import urllib
import urllib2

baseurl = "http://m-selig.ae.illinois.edu/props/volume-1/data/"

soup = bs(urllib2.urlopen(baseurl), 'lxml')
links = soup.findAll("a")
for link in links:
print link.text
urllib.urlretrieve(baseurl+link.text, link.text)


When I run this code, the
print(link.text)
line prints the correct file names and the directory gets populated with files with the correct names, but the contents of the files look something like:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /props/volume-1/data/ ance_8.5x6_2849cm_4000.txt was not found on this server.</p>
<p>Additionally, a 404 Not Found
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.2.29 (Unix) mod_ssl/2.2.29 OpenSSL/1.0.1e-fips mod_bwlimited/1.4 Server at m-selig.ae.illinois.edu Port 80</address>
</body></html>


Thus, I'm sure the communication is working, but I'm not instructing BS correctly on how to save the contents of the files.

Also, I'm currently downloading all the files with the
findAll("a")
command, but I would actually like to only download specific files with names such as
*geom.txt

Answer
from bs4 import BeautifulSoup as bs
import urllib 
import urllib2

baseurl = "http://m-selig.ae.illinois.edu/props/volume-1/data/"

soup = bs(urllib2.urlopen(baseurl), 'lxml')
links = soup.findAll("a")
for link in links:
    print link.text
    data = urllib.urlopen(baseurl+link.text.strip())
    with open(link.text,"wb") as fs:
        fs.write(data.read())

Use strip() function to remove the spaces from your url and it will work fine.

Comments