Mary Mary - 5 months ago 8
Python Question

Extracting information from a table except header of the table using bs4

I am trying to extracting information from a table using bs4 and python.
when I am using the following code to extract information from header of the table:

tr_header=table.findAll("tr")[0]
tds_in_header = [td.get_text() for td in tr_header.findAll("td")]
header_items= [data.encode('utf-8') for data in tds_in_header]
len_table_header = len (header_items)


It works, but for the following codes that I am trying to extract information from the first row to the end of the table:

tr_all=table.findAll("tr")[1:]
tds_all = [td.get_text() for td in tr_all.findAll("td")]
table_info= [data.encode('utf-8') for data in tds_all]


There is the following error:

AttributeError: 'list' object has no attribute 'findAll'


Can anyone help me to edit it.

This is table information:

<table class="codes"><tr><td><b>Code</b>
</td><td><b>Display</b></td><td><b>Definition</b></td>
</tr><tr><td>active<a name="active"> </a></td>
<td>Active</td><td>This account is active and may be used.</td></tr>
<tr><td>inactive<a name="inactive"> </a></td>
<td>Inactive</td><td>This account is inactive
and should not be used to track financial information.</td></tr></table>


This is the output for tr_all:

[<tr><td><b>Code</b></td><td><b>Display</b></td><td><b>Definition</b></td></tr>, <tr><td>active<a name="active"> </a></td><td>Active</td><td>This account is active and may be used.</td></tr>, <tr><td>inactive<a name="inactive"> </a></td><td>Inactive</td><td>This account is inactive and should not be used to track financial information.</td></tr>]

Answer

For Your first question,

import bs4

text = """
<table class="codes"><tr><td><b>Code</b>
</td><td><b>Display</b></td><td><b>Definition</b></td>
</tr><tr><td>active<a name="active"> </a></td>
<td>Active</td><td>This account is active and may be used.</td></tr>
<tr><td>inactive<a name="inactive"> </a></td>
<td>Inactive</td><td>This account is inactive
 and should not be used to track financial information.</td></tr></table>"""

table = bs4.BeautifulSoup(text)
tr_all = table.findAll("tr")[1:]
tds_all = []
for tr in tr_all:
    tds_all.append([td.get_text() for td in tr.findAll("td")])
    # if You prefer double list comprefension instead...
table_info = [data[i].encode('utf-8') for data in tds_all
                                      for i in range(len(tds_all))]
print(table_info)

yields

['active ', 'Active', 'inactive ', 'Inactive']

And regarding Your second question

tr_header=table.findAll("tr")[0] i do not get a list

True, [] is indexing operation, which selects first element from list, thus You get single element. [1:] is slicing operator (take a look at nice tutorial if You need more information).

Actually, You get list two times, for each call of table.findAll("tr") - for header and rest of rows. Sure, this is quite redundant. If You want to separate tokens from header and rest, I think You likely want something like this

tr_all = table.findAll("tr")
header = tr_all[0]
tr_rest = tr_all[1:] 
tds_rest = []
header_data = [td.get_text().encode('utf-8') for td in header]

for tr in tr_rest:
     tds_rest.append([td.get_text() for td in tr.findAll("td")])

and regarding third question

Is it possible to edit this code to add table information from the first row to the end of the table?

Given Your desired output in comments below:

rows_all = table.findAll("tr")
header = rows_all[0]
rows = rows_all[1:]

data = []
for row in rows:
    for td in row:
        try:
            data.append(td.get_text())
        except AttributeError:
            continue
print(data)

# or more or less same as above, oneline
data = [td.get_text() for row in rows for td in row.findAll("td")]

yields

[u'active', u'Active', u'This account is active and may be used.', u'inactive', u'Inactive', u'This account is inactive and should not be used to track financial information.']
Comments