Phillip Phillip - 29 days ago 7
Python Question

Python requests module gets the same results despite incrementing page number

The only thing that changes in the URL is the page number, which is incremented after each request.

Other than Selenium or related tools, I’m not sure what approach could be used to traverse the pages. My instinct is that there may be some header/query combination to get the data directly, but I don't know where to find it.

url = 'http://therunningbug.co.uk/events/find-races.aspx?EventName=&AddressRegion=&AddressCounty=&Date=&Surface=#Sort=Date&page='

page = 1

while True:

pageData = BeautifulSoup(requests.get(url + str(page)).content)

articles = pageData.find('div', {'class':"items-content"})

for a in articles.find_all('article'):
name = a.find('span', {'itemprop':"name"}).text
d, t = a.find('time').get('datetime').split('T')

timeData = t[:-3]

dateData = d.split('-')
date = (dateData[1] + '/' + dateData[2] + '/' + dateData[0][2:]).strip()
description = a.find('p', {'itemprop':"description"}).text.strip()
webLink = 'http://therunningbug.co.uk' + a.find('a', {'itemprop':"url"}).get('href')
category = a.find('span', {'class':"surface"}).text
location = a.find('span', {'class':"region"}).text + ', ' + a.find('span', {'class':"county"}).text

print name, ' -- name'
print date, ', ', timeData, ' -- date, time'
print description, ' -- description'
print webLink, ' -- website link'
print category, ' -- category'
print location, ' -- location\n'

page += 1

Answer

The problem was probably URL encoding. You can urlencode:

url = 'http://therunningbug.co.uk/events/find-races.aspx'
payload = {'page': page}
pageData = BeautifulSoup(requests.get(url, params = payload).content)

This also works as there are no complex characters in the URI to really URL encode.

url = 'http://therunningbug.co.uk/events/find-races.aspx'
pageData = BeautifulSoup(requests.get(url + '?page=' + str(page)).content)

See requests documentation for the url encoding. http://docs.python-requests.org/en/master/user/quickstart/

Complete Code:

#!/usr/bin/env python

import requests
from bs4 import BeautifulSoup

page = 1
while True:

    url = 'http://therunningbug.co.uk/events/find-races.aspx'
    payload = {'page': page}
    pageData = BeautifulSoup(requests.get(url, params = payload).content)

    articles = pageData.find('div', {'class':"items-content"})

    for a in articles.find_all('article'):
        name = a.find('span', {'itemprop':"name"}).text
        d, t = a.find('time').get('datetime').split('T')

        timeData = t[:-3]

        dateData = d.split('-')
        date = (dateData[1] + '/' + dateData[2] + '/' + dateData[0][2:]).strip()
        description = a.find('p', {'itemprop':"description"}).text.strip()
        webLink = 'http://therunningbug.co.uk' + a.find('a', {'itemprop':"url"}).get('href')
        category = a.find('span', {'class':"surface"}).text
        location = a.find('span', {'class':"region"}).text + ', ' + a.find('span', {'class':"county"}).text

        print name, ' -- name'
        print date, ', ', timeData, ' -- date, time'
        print description, ' -- description'
        print webLink, ' -- website link'
        print category, ' -- category'
        print location, ' -- location\n'

    page += 1
Comments