athira athira - 9 months ago 34
HTML Question

How can I parse long web pages with beatiful soup?

I have been using following code to parse web page in the link The code is expected to return the links of all the members of the given page.

from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('').read()
soup = BeautifulSoup(r,'lxml')
headers = soup.find_all('h3')
for header in headers:
a = header.find('a')

But I get only the first 10 links from the above page. Even while printing the prettify option I see only the first 10 links .Can anyone help me to resolve the issue?


The results are dynamically loaded by making AJAX requests to the endpoint.

Simulate them in your code with requests maintaining a web-scraping session:

from bs4 import BeautifulSoup
import requests

url = ""
with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}

    page = 0
    members = []
    while True:
        # get page
        response =, data={
            "p": str(page),
            "id": "#scrollbox1"
        html = response.json()['html']

        # parse html
        soup = BeautifulSoup(html, "html.parser")
        page_members = [member.get_text() for member in".memberentry h3 a")]
        print(page, page_members)

        page += 1

It prints the current page number and the list of members per page accumulating member names into a members list. Not posting what it prints since it contains names.

Note that I've intentionally left the loop endless, please figure out the exit condition. May be when response.json() throws an error.