AD233 AD233 - 1 year ago 75
HTML Question

Beautifulsoup HTML table parsing--only able to get the last row?

I have a simple HTML table to parse but somehow Beautifulsoup is only able to get me results from the last row. I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table:

<table class='participants-table'>
<th data-field="name" class="sort-direction-toggle name">Name</th>
<th data-field="type" class="sort-direction-toggle type active-sort asc">Type</th>
<th data-field="sector" class="sort-direction-toggle sector">Sector</th>
<th data-field="country" class="sort-direction-toggle country">Country</th>
<th data-field="joined_on" class="sort-direction-toggle joined-on">Joined On</th>
<th class='name'><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>
<td class='type'>Company</td>
<td class='sector'>General Industrials</td>
<td class='country'>Netherlands</td>
<td class='joined-on'>2000-09-20</td>
<th class='name'><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>
<td class='type'>Company</td>
<td class='sector'>Pharmaceuticals &amp; Biotechnology</td>
<td class='country'>Portugal</td>
<td class='joined-on'>2004-02-19</td>

I use the following codes to get the rows:

table=soup.find_all("table", class_="participants-table")

This gets:

<th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>
<td class="type">Company</td>
<td class="sector">General Industrials</td>
<td class="country">Netherlands</td>
<td class="joined-on">2000-09-20</td>
</tr>, <tr>
<th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>
<td class="type">Company</td>
<td class="sector">Pharmaceuticals &amp; Biotechnology</td>
<td class="country">Portugal</td>
<td class="joined-on">2004-02-19</td>

As expected, it looks like. However, if I continue:

for row in rows:
cells = row.find_all('th')

I'm only able to get the last entry!

cells=[<th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]

What is going on? This is my first time using beautifulsoup, and what I'd like to do is to export this table into CSV. Any help is greatly appreciated! Thanks

Answer Source

You need to extend if you want all the th tags in a single list, you just keep reassigning cells = row.find_all('th') so when your print cells outside the loop you will only see what it was last assigned to i.e the last th in the last tr:

cells = []
for row in rows:

Also since there is only one table you can just use find:

soup = BeautifulSoup(html)

table = soup.find("table", class_="participants-table")

If you want to skip the thead row you can use a css selector:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

rows ="table.participants-table  thead ~ tr")

cells = [ for tr in rows]

cells will give you:

[<th class="name"><a href="/what-is-gc/participants/4479-Grontmij">Grontmij</a></th>, <th class="name"><a href="/what-is-gc/participants/4492-Groupe-Bial">Groupe Bial</a></th>]

To write the whole table to csv:

import csv

soup = BeautifulSoup(html, "html.parser")

rows ="table.participants-table tr")

with open("data.csv", "w") as out:
    wr = csv.writer(out)
    wr.writerow([th.text for th in rows[0].find_all("th")] + ["URL"])

    for row in rows[1:]:
        wr.writerow([tag.text for tag in row.find_all()] + [["href"]])

which for you sample will give you:

Name,Type,Sector,Country,Joined On,URL
Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij
Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial