Omi Slash Omi Slash - 3 months ago 17
Python Question

Python->Beautifulsoup->Webscraping->Looping over URL (1 to 53) and saving Results

Here is the Website I am trying to scrape http://livingwage.mit.edu/

The specific URLs are from

http://livingwage.mit.edu/states/01

http://livingwage.mit.edu/states/02

http://livingwage.mit.edu/states/04 (For some reason they skipped 03)

...all the way to...

http://livingwage.mit.edu/states/56


And on each one of these URLs, I need the last row of the second table:


Example for http://livingwage.mit.edu/states/01

Required annual income before taxes $20,260 $42,786 $51,642
$64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691
$56,934 $66,997


Desire output:

Alabama $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997

Alaska $24,070 $49,295 $60,933 $79,871 $38,561 $47,136 $52,233 $61,531 $38,561 $54,433 $66,316 $82,403

...

...

Wyoming $20,867 $42,689 $52,007 $65,892 $34,988 $41,887 $46,983 $53,549 $34,988 $47,826 $57,391 $68,424

After 2 hours of messing around, this is what I have so far (I am a beginner):

import requests, bs4

res = requests.get('http://livingwage.mit.edu/states/01')

res.raise_for_status()
states = bs4.BeautifulSoup(res.text)


state_name=states.select('h1')

table = states.find_all('table')[1]
rows = table.find_all('tr', 'odd')[4:]


result=[]

result.append(state_name)
result.append(rows)


When I viewed the state_name and rows in Python Console it give me the html elements

[<h1>Living Wag...Alabama</h1>]


and

[<tr class = "odd... </td> </tr>]


Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above?

Problem 2: How do I loop through the request.get(url01 to url56)?

Thank you for your help.

And if you can offer a more efficient way of getting to the rows variable in my code, I would greatly appreciate it, because the way I get there is not very Pythonic.

Answer

Just get all the states from the initial page, then you can select the second table and use the css classes odd results to get the tr you need, there is no need to slice as the class names are unique:

import requests
from bs4 import BeautifulSoup
from urllib.parse import  urljoin # python2 -> from urlparse import urljoin 


base = "http://livingwage.mit.edu"
res = requests.get(base)

res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.
# td + td skips the first td which is *Required annual income before taxes*
# get all the anchors inside each li that are children of the
# ul with the css class  "states list".
for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
    # The hrefs look like "/states/51/locations".
    #  We want everything before /locations so we split on / from the right -> /states/51/
    # and join to the base url. The anchor text also holds the state name,
    # so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
    states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))


def parse(soup):
    # Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
    table = soup.select_one("table:nth-of-type(2)")
    # To get the text, we just need find all the tds and call .text on each.
    #  Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.
    return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]


# Unpack the url and state from each tuple in our states list. 
for url, state in states:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print(state, parse(soup))

If you run the code you will see output like:

Alabama ['$21,144', '$43,213', '$53,468', '$67,788', '$34,783', '$41,847', '$46,876', '$52,531', '$34,783', '$48,108', '$58,748', '$70,014']
Alaska ['$24,070', '$49,295', '$60,933', '$79,871', '$38,561', '$47,136', '$52,233', '$61,531', '$38,561', '$54,433', '$66,316', '$82,403']
Arizona ['$21,587', '$47,153', '$59,462', '$78,112', '$36,332', '$44,913', '$50,200', '$58,615', '$36,332', '$52,483', '$65,047', '$80,739']
Arkansas ['$19,765', '$41,000', '$50,887', '$65,091', '$33,351', '$40,337', '$45,445', '$51,377', '$33,351', '$45,976', '$56,257', '$67,354']
California ['$26,249', '$55,810', '$64,262', '$81,451', '$42,433', '$52,529', '$57,986', '$68,826', '$42,433', '$61,328', '$70,088', '$84,192']
Colorado ['$23,573', '$51,936', '$61,989', '$79,343', '$38,805', '$47,627', '$52,932', '$62,313', '$38,805', '$57,283', '$67,593', '$81,978']
Connecticut ['$25,215', '$54,932', '$64,882', '$80,020', '$39,636', '$48,787', '$53,857', '$61,074', '$39,636', '$60,074', '$70,267', '$82,606']

You could loop in a range from 1-53 but extracting the anchor from the base page also gives us the state name in a single step, using the h1 from that page would also give you output Living Wage Calculation for Alabama which you would have to then try to parse to just get the name which would not be trivial considering some states have more the one word names.

Comments