jake wong jake wong - 4 months ago 84
Python Question

python selenium scraping tbody

The below is the HTML code which I'm trying to scrape

<div class="data-point-container section-break">
# some other HTML div classes here which I don't need
<table class data-bind="showHidden: isData">
<!-- ko foreach : sections -->
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<!-- /ko -->
</table>
</div>


How do I use
Pandas.read_html
to scrape all these information, having
thead
as headers, and
tbody
as values?

EDIT:

This is the site that I'm trying to scrape, and have the data extracted into Pandas Dataframe. Link here

Answer

Strictly speaking, one should not have more than one thead element per table according to the table element specification.

If you still have this thead followed by corresponding tbody structure, I would parse that iteratively - every structure like this into it's own dataframe.

Working example:

import pandas as pd
from bs4 import BeautifulSoup

data = """
<div class="data-point-container section-break">
    <table class data-bind="showHidden: isData">

        <thead>
            <tr><th>Customer</th><th>Order</th><th>Month</th></tr>
        </thead>
        <tbody>
            <tr><td>Customer 1</td><td>#1</td><td>January</td></tr>
            <tr><td>Customer 2</td><td>#2</td><td>April</td></tr>
            <tr><td>Customer 3</td><td>#3</td><td>March</td></tr>
        </tbody>

        <thead>
            <tr><th>Customer</th></tr>
        </thead>
        <tbody>
            <tr><td>Customer 4</td></tr>
            <tr><td>Customer 5</td></tr>
            <tr><td>Customer 6</td></tr>
        </tbody>

    </table>
</div>
"""

soup = BeautifulSoup(data, "html.parser")
for thead in soup.select(".data-point-container table thead"):
    tbody = thead.find_next_sibling("tbody")

    table = "<table>%s</table>" % (str(thead) + str(tbody))

    df = pd.read_html(str(table))[0]
    print(df)
    print("-----")

Prints 2 dataframes - one for every thead&tbody in the sample input HTML:

     Customer Order    Month
0  Customer 1    #1  January
1  Customer 2    #2    April
2  Customer 3    #3    March
-----
     Customer
0  Customer 4
1  Customer 5
2  Customer 6
-----

Note that I've intentionally made the number of header and data cells different in every block for demonstration purposes.