David Owens David Owens - 1 month ago 8
HTML Question

Using BeautifulSoup to scrape li's and id's in same method

How would i modify the parameters of the findAll method to read both li's and id's? li's are elements and id's are attributes correct?

#Author: David Owens
#File name: soupScraper.py
#Description: html scraper that takes surf reports from various websites

import csv
import requests
from bs4 import BeautifulSoup

###################### SURFLINE URL STRINGS AND TAG ###########################

slRootUrl = 'http://www.surfline.com/surf-report/'
slSunsetCliffs = 'sunset-cliffs-southern-california_4254/'
slScrippsUrl = 'scripps-southern-california_4246/'
slBlacksUrl = 'blacks-southern-california_4245/'
slCardiffUrl = 'cardiff-southern-california_4786/'

slTagText = 'observed-wave-range'
slTag = 'id'

#list of surfline URL endings
slUrls = [slSunsetCliffs, slScrippsUrl, slBlacksUrl, slCardiffUrl]

###############################################################################


#################### MAGICSEAWEED URL STRINGS AND TAG #########################

msRootUrl = 'http://magicseaweed.com/'
msSunsetCliffs = 'Sunset-Cliffs-Surf-Report/4211/'
msScrippsUrl = 'Scripps-Pier-La-Jolla-Surf-Report/296/'
msBlacksUrl = 'Torrey-Pines-Blacks-Beach-Surf-Report/295/'

msTagText = 'rating-text text-dark'
msTag = 'li'

#list of magicseaweed URL endings
msUrls = [msSunsetCliffs, msScrippsUrl, msBlacksUrl]

###############################################################################

'''
This method iterates through a list of urls and extracts the surf report from
the webpage dependent upon its tag location

rootUrl: The root url of each surf website
urlList: A list of specific urls to be appended to the root url for each
break
tag: the html tag where the actual report lives on the page

returns: a list of strings of each breaks surf report
'''
def extract_Reports(rootUrl, urlList, tag, tagText):
#empty list to hold reports
reports = []
#loop thru URLs
for url in urlList:
try:
#request page
request = requests.get(rootUrl + url)

#turn into soup
soup = BeautifulSoup(request.content, 'lxml')

#get the tag where report lives
reportTag = soup.findAll(id = tagText)

for report in reportTag:
reports.append(report.string.strip())

#notify if fail
except:
print 'scrape failure'
pass

return reports
#END METHOD

slReports = extract_Reports(slRootUrl, slUrls, slTag, slTagText)
msReports = extract_Reports(msRootUrl, msUrls, msTag, msTagText)

print slReports
print msReports


As of right now, only slReports prints correctly because i have it explicitly set to id = tagText. I am also aware that my tag paramater is not used currently.

Answer

So the problem is that you want to search the parse tree for elements that have either a class name of rating-text (it turns out you do not need text-dark to identify the relevant elements in the case of Magicseaweed) or an ID of observed-wave-range, using a single findAll call.

You can use a filter function to achieve this:

def reportTagFilter(tag):
    return (tag.has_attr('class') and 'rating-text' in tag['class']) \
        or (tag.has_attr('id') and tag['id'] == 'observed-wave-range')

Then change your extract_Reports function to read:

        reportTag = soup.findAll(reportTagFilter)[0]
        reports.append(reportTag.text.strip())
Comments