dexafree dexafree - 1 year ago 77
Python Question

Parsing HTML with BeautifulSoup in Python

I am trying to parse HTML with Python using BeautifulSoup, but I can't manage to get what I need.

This is a little module of a personal app I want to do, and it consists in a web login part with credentials, and once the script is logged in the web, I need to parse some information in order to manage it and process it.

The HTML code after getting logged is:

<div class="widget_title clearfix">

<h2>Account Balance</h2>

</div>

<div class="widget_body">

<div class="widget_content">

<table class="simple">

<tr>

<td><a href="#" id="west1" title="Total earned daily">Daily Earnings</a></td>

<td style="text-align: right; width: 125px; color: #119911; font-weight: bold;">

150

</td>

</tr>

<tr>

<td><a href="#" id="west2" title="Total weekly earnings">Weekly Earnings</a></td>

<td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;">

500 </td>

</tr>

<tr>

<td><a href="#" id="west3" title="Total Monthly earnings">Monthly Earnings</a></td>

<td style="text-align: right; color: #119911; font-weight: bold;">

1500 </td>

</tr>

<tr>

<td><a href="#" id="west4" title="Total expenses">Total expended</a></td>

<td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;">

430 </td>

</tr>

<tr>

<td><a href="#" id="west5" title="Total available">Account Balance</a></td>

<td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;">

840 </td>

</tr>

<tr>

<td></td>

<td style="padding: 5px;">

<center>

<form id="request_bill" method="POST" action="index.php?page=dashboard">

<input type="hidden" name="secret_token" value="" />

<input type="hidden" name="request_payout" value="1" />

<input type="submit" class="btn blue large" value="Request Payout" />

</form>

</center>

</td>

</tr>

</table>

</div>

</div>

</div>


As you can see, it's not a very well-formatted HTML, but I'd need to extract the elements and their values, I mean, for example: "Daily earnings" and "150" | "Weekly earnings" and "500"...

I think that the "id" attribute may help, but when I try to parse it, it crashes.

The Python code I'm working with is:

def parseo(archivohtml):
html = archivohtml
parsed_html = BeautifulSoup(html)
par = parsed_html.find('td', attrs={'id':'west1'}).string
print par


Where archivohtml is the saved html file after logging in the web

When I run the script, I only get errors.

I've also tried doing this:

def parseo(archivohtml):
soup = BeautifulSoup()
html = archivohtml
parsed_html = soup(html)
par = soup.parsed_html.find('td', attrs={'id':'west1'}).string
print par


But the result is still the same.

Answer Source

The tag with id="west1" is an <a> tag. You are looking for the <td> tag that comes after this <a> tag:

import BeautifulSoup as bs

content = '''<div class="widget_title clearfix">
        <h2>Account Balance</h2>
    </div>
    <div class="widget_body">
        <div class="widget_content">
            <table class="simple">
                <tr>
                    <td><a href="#" id="west1" title="Total earned daily">Daily Earnings</a></td>
                    <td style="text-align: right; width: 125px; color: #119911; font-weight: bold;">
                        150                         
                    </td>
                </tr>
                <tr>
                    <td><a href="#" id="west2" title="Total weekly earnings">Weekly Earnings</a></td>
                    <td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;">
                        500                     </td>
                </tr>
                <tr>
                    <td><a href="#" id="west3" title="Total Monthly earnings">Monthly Earnings</a></td>
                    <td style="text-align: right; color: #119911; font-weight: bold;">
                        1500                        </td>
                </tr>
                <tr>
                    <td><a href="#" id="west4" title="Total expenses">Total expended</a></td>
                    <td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;">
                        430                     </td>
                </tr>
                <tr>
                    <td><a href="#" id="west5" title="Total available">Account Balance</a></td>
                    <td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;">
                        840                     </td>
                </tr>
                <tr>
                    <td></td>
                    <td style="padding: 5px;">
                        <center>
                            <form id="request_bill" method="POST" action="index.php?page=dashboard">
                                <input type="hidden" name="secret_token" value="" />
                                <input type="hidden" name="request_payout" value="1" />
                                <input type="submit" class="btn blue large" value="Request Payout" />
                            </form>
                        </center>
                    </td>
                </tr>
            </table>
        </div>
    </div>
</div>'''

def parseo(archivohtml):
    html = archivohtml
    parsed_html = bs.BeautifulSoup(html)
    par = parsed_html.find('a', attrs={'id':'west1'}).findNext('td')        
    print par.string.strip()

parseo(content)

yields

150
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download