MikeD MikeD - 14 days ago 7
Python Question

Parsing webpage with robobrowser and beautifulsoup

I'm new to webscraping trying to parse a website after doing a form submission with robobrowser. I get the correct data back (I can view it when I do: print(browser.parsed)) but am having trouble parsing it. The relevant part of the source code of the webpage looks like this:

<div id="ii">
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 &nbsp;</td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN &nbsp;</td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE &nbsp;</td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 &nbsp;</td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White &nbsp;</td>
</tr>


if I do

in: browser.select('#t1b")


I get:

out: [<td id="t1b" scope="row">Inmate Name</td>]


instead of JONES, JOHN.

The only way I've been able to get the relevant data is doing:

browser.select('tr')


This returns a list of each of the 29 with each 'tr' that I can convert to text and search for the relevant info.

I've also tried creating a BeautifulSoup object:

x = browser.select('#ii')
soup = BeautifulSoup(x[0].text, "html.parser")


but it loses all tags/ids and so I can't figure out how to search within it.

Is there an easy way to have it loop through each element with 'tr' and get the actual data and not the label as oppose to repeatedly converting to a string variable and searching through it?

Thanks

Answer

Get all the "label" td elements and get the next td sibling value collecting results into a dict:

from pprint import pprint
from bs4 import BeautifulSoup

data = """
<table>
    <tr>
      <td scope="row" id="t1a"> ID (ID Number)</a></td>
      <td headers="t1a">1234567 &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1b">Participant Name</td>
      <td headers="t1b">JONES, JOHN                          &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1c">Sex</td>
      <td headers="t1c">MALE   &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1d">Date of Birth</td>
      <td headers="t1d">11/25/2016 &nbsp;</td>
    </tr>
    <tr>
      <td scope="row" id="t1e">Race / Ethnicity</a></td>
      <td headers="t1e">White                  &nbsp;</td>
    </tr>
</table>
"""

soup = BeautifulSoup(data, 'html5lib')

data = {
    label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
    for label in soup.select("tr > td[scope=row]")
}
pprint(data)

Prints:

{'Date of Birth': '11/25/2016',
 'ID (ID Number)': '1234567',
 'Participant Name': 'JONES, JOHN',
 'Race / Ethnicity': 'White',
 'Sex': 'MALE'}

Note: this is also covered in the BeautifulSoup StackOverflow documentation.