bentsh bentsh - 7 months ago 48
Python Question

Find elements by text with Beautifulsoup

I'm just learning Python and I've spent hours try to figure this one thing out. Basically, I have html doc with a repetitive structure and I am trying to pull out certain elements from each repetition. I figured out how to pull out the first element, but I cannot for the life of me figure out to pull any of the others. The first one one easy because it has a distinct class, but the rest don't. Please help before I go insane.

The following is the repetitive section of html. I want to pull out the first header, which I was able to do. I also want to get the "Synopsis" and "Risk Factor".



<h2 xmlns="" class="classsection4" id="idp201558400">50044 (1) - Ubuntu
6.06 LTS / 8.04 LTS / 9.04 / 9.10 / 10.04 LTS / 10.10 : linux,
linux-ec2, linux-source-2.6.15 vulnerabilities (USN-1000-1)</h2>
<h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->Synopsis</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">The remote Ubuntu host is missing one or more security-related patches.</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->Description</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">This is some description text.
(CVE-2010-NNN2).</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->Solution</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">Update the affected packages.</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->Risk Factor</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">Critical</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->CVSS Base Score</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">10.0 (CVSS2#AV:N/AC:L/Au:N/C:C/I:C/A:C)</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->CVSS Temporal Score</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">8.7 (CVSS2#E:ND/RL:OF/RC:ND)</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">





Here is my current code:

import requests
from bs4 import BeautifulSoup
import urllib
import re

page = open("C:/Users/AlphaWP/Downloads/631_SupportingFiles4_Labs6-7/Nessus Vulnerability Scan.htm").read()

soup = BeautifulSoup(page, "html.parser")

for section in soup.findAll("h2",{"class":"classsection4"}):
# nextNode = section
# print(nextNode.name)
# print(section)
print(section.contents)
print("##############################")
# print(section.contents)
for section1 in soup.findAll('h2', text=re.compile(r'Risk')):
print(section1)
riskFactor = section1.find("span")
riskLevel = riskFactor.contents
print(riskLevel)
print("##############################")

Answer

To get all the span elements use:

spans = soup.find_all('span', {'class': 'classtext'})

spans is now a list of all span elements with class classtext. To access Synopsis span and Risk Factor span:

>>> spans[0]
<span class="classtext" style="color: #263645; font-weight: normal;" xmlns="">The remote Ubuntu host is missing one or more security-related patches.</span>
>>> spans[3]
<span class="classtext" style="color: #263645; font-weight: normal;" xmlns="">Critical</span>