User Error User Error - 7 months ago 18
Python Question

Python web scrape failing

Hoping someone can help me through a webscrape. It worked last year daily just fine, turned it off for winter and now something on the page has changed and it's no longer working.
I need to extract the Danger Rating codes for each station listed. Last year, with BS looking for the "tr" tag it worked perfectly. I'm stumped.

Here's the site for a sample region: http://bcwildfire.ca/hprScripts/DgrCls/index.asp?Region=4

Here's my code up to where BS does its thing:

from urllib import urlopen
from HTMLParser import HTMLParser
import string, datetime, sys
from bs4 import BeautifulSoup

# Fire Danger ratings by station start at index 4

class HTMLCleaner(HTMLParser):
container = ""
def handle_data(self, data):
self.container = self.container + "," + data
return self.container

todayChk = datetime.date.today().strftime("%d-%b-%Y")

##FireRegions = {'Prince George': '4', 'Northwest': '3', 'Cariboo': '7', 'Kamloops': '5', 'Southeast': '6'}
FireRegions = {'Prince George': '4'}

Regs = FireRegions.keys()
Reg = 0

while Reg < len(FireRegions):
print Regs[Reg] + " Region"
content = urlopen('http://bcwildfire.ca/hprScripts/DgrCls/index.asp?Region='+FireRegions[Regs[Reg]]).read()
soup = BeautifulSoup(content, 'html.parser')
PGStats = soup.body.find_all("tr")
print PGStats
Reg+=1


Thanks so much if you can offer a solution.

Answer

Looks like the problem is because of the extra table and tr elements on the page. You need to narrow down the search to a specific table with the stations and rating inside.

One option, since there are no id or class attributes which we can use to distinguish the desired table from others, would be to find a table header by text and then go up to the parent table element:

table = soup.find(text="[Dgr Rgn] Station").find_parent("table")
for row in table.find_all("tr")[1:]:
    cells = row.find_all("td")
    print(cells[0].get_text(), cells[1].get_text())

Prints:

(u'[1] BEAR LAKE', u'3')
(u'[1] BEDNESTI', u'3')
(u'[1] CHETWYND (EC)', u'4')
...
(u'[1] VALEMOUNT HUB', u'4')
(u'[1] VANDERHOOF HUB', u'4')
(u'[1] WONOWON', u'4')