Gabriel Gabriel - 2 months ago 16
Python Question

Scraping several pages with BeautifulSoup

I'd like to scrape through several pages of a website using Python and BeautifulSoup4. The pages differ by only a single number in their URL, so I could actually make a declaration like this:

theurl = "beginningofurl/" + str(counter) + "/endofurl.html"

The link I've been testing with is this:

And my python script is this: http://www.worldofquotes.com/topic/Art/1/index.html

import urllib
import urllib.request
from bs4 import BeautifulSoup


def category_crawler():
''' This function will crawl through an entire category, regardless how many pages it consists of. '''

pager = 1

while pager < 11:
theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.parser")

for link in soup.findAll('blockquote'):
sanitized = link.find('p').text.strip()
spantext = link.find('a')
writer = spantext.find('span').text
print(sanitized)
print(writer)
print('---------------------------------------------------------')


pager += 1

category_crawler()


So the question is: how to change the hardcoded number in the while loop into a solution that makes the script automatically recognize that it passed the last page, and then it quits automatically?

Answer

The idea is to have an endless loop and break it once you don't have the "arrow right" element on the page which would mean you are on the last page, simple and quite logical:

import requests
from bs4 import BeautifulSoup


page = 1
url = "http://www.worldofquotes.com/topic/Nature/{page}/index.html"
with requests.Session() as session:
    while True:
        response = session.get(url.format(page=page))
        soup = BeautifulSoup(response.content, "html.parser")

        # TODO: parse the page and collect the results

        if soup.find(class_="icon-arrow-right") is None:
            break  # last page

        page += 1

Note that we are maintaining the web-scraping session with requests here.

Comments