Kainesplain Kainesplain - 2 months ago 23
Python Question

Python Web-scraping Solution

So, I'm new to python and am trying to develop an exercise in which I scrape the page numbers from a list on this url, which is a list of various published papers.

When I go into the HTML element for the page I want to scrape, I inspect the element and find this HTML code to match up:

<div class="src">
Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
</div>


The part that I want to churn out what is in between the class brackets.
This is what I attempted to write in order to do the job.

import requests
from bs4 import BeautifulSoup

url = "http://www.jstor.org/action/doAdvancedSearch?c4=AND&c5=AND&q2=&pt=&q1=nuclear&f3=all&f1=all&c3=AND&c6=AND&q6=&f4=all&q4=&f0=all&c2=AND&q3=&acc=off&c1=AND&isbn=&q0=china+&f6=all&la=&f2=all&ed=2001&q5=&f5=all&group=none&sd=2000"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("div class='src'")
for link in links:
print


I know that this code is unfinished and that's because I don't know where to go from here :/. Can anyone help me here?

Answer

If I understand you correctly, you want the pages inside all divs with class="src"

If so, then you need to do:

import requests
import re
from bs4 import BeautifulSoup

url = "http://www.jstor.org/action/doAdvancedSearch?c4=AND&c5=AND&q2=&pt=&q1=nuclear&f3=all&f1=all&c3=AND&c6=AND&q6=&f4=all&q4=&f0=all&c2=AND&q3=&acc=off&c1=AND&isbn=&q0=china+&f6=all&la=&f2=all&ed=2001&q5=&f5=all&group=none&sd=2000"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all('div', {'class':'src'})
for link in links:
    pages = re.findall('(pp.\s*\d*-\d*)', link.text)
    print pages[0]

Note that I have used regex to get the page numbers. This may sound strange for people unfamiliar with regular expressions, but I think its more elegant than using string operations like strip and split