champion Ch champion Ch - 1 year ago 175
Python Question

Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

I want to get all the titles() in the website.

Now, my code successfully scrapes only one page. However, there are multiple pages available at the site above in which I would like to to scrape.

For example, with the url above, when I click the link to "page 2", the overall url does NOT change. I looked at the page source and saw javascript code to advance to the next page like this: javascript:gotopage(2) or javascript:void(0).
My code is here (get page 1)

from bs4 import Beautifulsoup
import requests
url = ''
r = requests.get(url)
soup = Beautifulsoup(r.content,'lxml')
titles ='td.tit3 > a')
for title in titles:

How can my code be changed to scrape titles from all the available listed pages?
Thank you very much!

Answer Source

Try to use the following URL format:

The site is using javascript to pass hidden page information to the server to request the next page. When you view the source you will find:

<form action="/zwhd/web/webindex.action" id="searchForm" name="searchForm" method="post">
 <div class="item">
     <div class="titlel">
     <label class="dow"></label>
     <input type="text" name="keyWord" id="keyword" value="" class="text"/>
     <div class="key">
            <li><span><input type="radio" checked="checked" value="3" name="searchType"/></span><p>编号</p></li>
            <li><span><input type="radio" value="2" name="searchType"/></span><p>关键字</p></li>
     <input type="button" class="btn1" onclick="search();" value="查询"/>
  <input type="hidden" id="pageIndex" name="page.currentpage" value="2"/>
  <input type="hidden" id="pageSize" name="page.pagesize" value="15"/>
  <input type="hidden" id="pageCount" name="page.pagecount" value="2357"/>
  <input type="hidden" id="docStatus" name="docStatus" value=""/>
  <input type="hidden" id="sendorg" name="sendOrg" value=""/>