ipmev12 ipmev12 - 4 months ago 16
HTML Question

trying to scrape text from html that doesnt have any distinctive tags except br, PYTHON 3

so I have been making a scraping program for my company websites but I have run into an issue, basically I need to scrape out test from a html table but The I am having trouble getting the data I need.

HTML CODE



<div>
<table class="style3" cellspacing="0" rules="all" border="1" id="ctl00_cpMainContent_gvNodes" style="border-color:White;border-style:None;width:1090px;border-collapse:collapse;">
<tr>
<th scope="col">History</th>
</tr><tr>
<td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td>
</tr><tr>
<td style="color:White;background-color:Blue;border-color:Black;border-style:Inset;font-size:12pt;font-weight:normal;">date updated: 02/01/2014 21:42:52 | By: jakubkwasny | Status: Resolved</td>
</tr><tr>
<td style="color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;"><br />Root Cause: Hardware Failure<br />Action Completed: Power supply/filter/cable swap<br /><br />Arrival Time: 02/01/2014 15:54:17<br />Leaving Time: 02/01/2014 16:27:44<br />Was the job successful: Yes<br /><br /><br />Notes:replaced dsl cable and filter. Also rebooted all equipment. All working fine now.<br />Next Action required:none<br />Added by jakubkwasny at 02/01/2014 21:41:40<br /><br />Pinging 99.99.99.99 with 32 bytes of data:<br />Reply from 99.99.99.99: bytes=32 time=67ms TTL=240<br />Reply from 99.999.999.99: bytes=32 time=92ms TTL=240<br />Reply from 99.99.65.65: bytes=32 time=76ms TTL=240<br />Reply from 67.45.32.12: bytes=32 time=82ms TTL=240<br /><br />Ping statistics for 12.12.12.12:<br />Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),<br />Approximate round trip times in milli-seconds:<br />Minimum = 67ms, Maximum = 92ms, Average = 79ms</td>
</tr><tr>
<td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td>


I need to able to scrape data inside br tags such as the data attached to the third td tag, I have managed to get scrape all the data from the table but cant figure out how to get specific rows and then get the stuff in the br tags

CODE SNIPPET

bsobjswap = BeautifulSoup(r2.content)
print (bsobjswap.find('table',{'id':'ctl00_cpMainContent_gvNodes'}).find("style",{"color":"Black"}))


this is my latest attempt but doesnt work. any help is appreciated

MORE DATA

<div id="ctl00_cpMainContent_upNodes">

<div>
<table class="style3" cellspacing="0" rules="all" border="1" id="ctl00_cpMainContent_gvNodes" style="border-color:White;border-style:None;width:1090px;border-collapse:collapse;">
<tr>
<th scope="col">History</th>
</tr><tr>
<td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td>
</tr><tr>
<td style="color:White;background-color:Blue;border-color:Black;border-style:Inset;font-size:12pt;font-weight:normal;">date updated: 02/01/2014 21:21:16 | By: jakubkwasny | Status: Resolved</td>
</tr><tr>
<td style="color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;"><br />Root Cause: Core / Authentication issue<br />Action Completed: No site visit required<br /><br />Hi Chris,<br /><br />There were no faults detected. As installation have been done recently, Lancom uses 2.05 configuration script. Our engineer was unable to see landing page, he was getting connected to the Internet with. I contacted Picopoint who informed me that this is due the fact that their system remembers MAC addresses of the devices that were logged into the system hence no landing page is needed. It have been confirmed by removing MAC addresses of the engineer's devices from the database. By doing so engineer was able to access the landing page again. Picopoint's engineer checked the configuration of the devices at both ends and haven't detected any problems. At the moment we are unable to state what are the issues with venue as we haven't experienced any. <br /><br />Arrival Time: 02/01/2014 16:19:23<br />Leaving Time: 02/01/2014 17:51:18<br />Was the job successful: Yes<br /><br /><br />Notes:Still physically missing lines 3 and 4. See screen shot.<br />Line 6 has a dial tone BUT no dsl is present on line.<br />Still getting some landing page errors.. My laptop now seems to work but my android phone justs connects to google with no landing page .<br /><br />Screen shots included but couldnt access youtube ( was recieveing an block ID error )<br />ASDA resriction ?<br /><br />Picopoint still looking into problem according to Jakub<br /><br />Next Action required:Ask Jakub<br />Added by jakubkwasny at 02/01/2014 21:10:12<br /><br />Pinging 11.11.11.11 with 32 bytes of data:<br />Reply from 11.11.11.11: bytes=32 time=47ms TTL=50<br />Reply from 11.11.11.11: bytes=32 time=38ms TTL=50<br />Reply from 11.11.11.11: bytes=32 time=39ms TTL=50<br />Reply from 11.11.11.11: bytes=32 time=41ms TTL=50<br /><br />Ping statistics for 11.11.11.11:<br />Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),<br />Approximate round trip times in milli-seconds:<br />Minimum = 38ms, Maximum = 47ms, Average = 41ms</td>
</tr><tr>
<td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td>


My code visits thousands of pages and looking at it each table follows the same pattern and I am guessing I will always need the data from the third td tag but not sure how to get it.

Cheers

Answer

How about this:

from bs4 import BeautifulSoup

html = """(your html from the example above)"""

soup = BeautifulSoup(html, 'html.parser')

row_data = soup.find('td', {'style':'color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;'})

clean_data = str(row_data).replace('<td style="color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;">','')\
    .replace('</td>','')

print('\n'.join([x for x in clean_data.split('<br/>') if x != '']))

"""
Generated output:

Root Cause: Hardware Failure
Action Completed: Power supply/filter/cable swap
Arrival Time: 02/01/2014 15:54:17
Leaving Time: 02/01/2014 16:27:44
Was the job successful: Yes
Notes:replaced dsl cable and filter. Also rebooted all equipment. All working fine now.
Next Action required:none
Added by jakubkwasny at 02/01/2014 21:41:40
Pinging 99.99.99.99 with 32 bytes of data:
Reply from 99.99.99.99: bytes=32 time=67ms TTL=240
Reply from 99.999.999.99: bytes=32 time=92ms TTL=240
Reply from 99.99.65.65: bytes=32 time=76ms TTL=240
Reply from 67.45.32.12: bytes=32 time=82ms TTL=240
Ping statistics for 12.12.12.12:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 67ms, Maximum = 92ms, Average = 79ms
"""