user2656931 user2656931 - 1 month ago 7
Python Question

Copying URLs to file that contain specific term

So I'm trying to get all the urls in the range whose pages contain either the term "Recipes adapted from" or "Recipe from". This copies all the links to the file up until about 7496, then it spits out HTTPError 404. What am I doing wrong? I've tried to implement BeautifulSoup and requests, but I still can't get it to work.

import urllib2
with open('recipes.txt', 'w+') as f:
for i in range(14477):
url = "http://www.tastingtable.com/entry_detail/{}".format(i)
page_content = urllib2.urlopen(url).read()
if "Recipe adapted from" in page_content:
print url
f.write(url + '\n')
elif "Recipe from" in page_content:
print url
f.write(url + '\n')
else:
pass

Answer

Some of the URLs you are trying to scrape do not exist. Simply skip perhaps, by ignoring the exception:

import urllib2
with open('recipes.txt', 'w+') as f:
    for i in range(14477):
        url = "http://www.tastingtable.com/entry_detail/{}".format(i)
        try:
            page_content = urllib2.urlopen(url).read()
        except urllib2.HTTPError as error:
            if 400 < error.code < 500:
                continue  # not found, unauthorized, etc.
            raise   # other errors we want to know about
        if "Recipe adapted from" in page_content or "Recipe from" in page_content:
            print url
            f.write(url + '\n')