oktober_1989_24 oktober_1989_24 - 1 month ago 23
HTML Question

beautifulSoup parsing tag table HTML, especially colspan and rowspan

I want to ask about parsing a value colspan and rowspan from

<table>
.
for example like this:

<table cellpadding="2" cellspacing="2" border="1" width="50%">
<tbody>
<tr>
<td valign="top" rowspan="2" colspan="1" align="center">NO<br>
</td>
<td valign="top" rowspan="1" colspan="3" align="center">NAMA<br>
</td>
<td valign="top" rowspan="1" colspan="2" align="center">TELEPON<br>
</td>
<td valign="top" rowspan="2" colspan="1" align="center">KODE<br>
</td>
</tr>
<tr>
<td valign="top" align="center">DEPAN<br>
</td>
<td valign="top" align="center">TENGAH<br>
</td>
<td valign="top" align="center">BELAKANG<br>
</td>
<td valign="top" align="center">KODE<br>
</td>
<td valign="top" align="center">NO TLP<br>
</td>
</tr>
<tr>
<td valign="top" align="center">1<br>
</td>
<td valign="top">Ani<br>
</td>
<td valign="top">Tiara<br>
</td>
<td valign="top">Ramadika<br>
</td>
<td valign="top" align="center">021<br>
</td>
<td valign="top" align="center">8466729<br>
</td>
<td valign="top" align="center">17412<br>
</td>
</tr>
<tr>
<td valign="top" align="center">2<br>
</td>
<td valign="top">Dia<br>
</td>
<td valign="top">Andari<br>
</td>
<td valign="top">Putri<br>
</td>
<td valign="top" align="center">022<br>
</td>
<td valign="top" align="center">5930290<br>
</td>
<td valign="top" align="center">18291<br>
</td>
</tr>
<tr>
<td valign="top" align="center">3<br>
</td>
<td valign="top">Rangga<br>
</td>
<td valign="top">Dimas<br>
</td>
<td valign="top">Putra<br>
</td>
<td valign="top" align="center">023<br>
</td>
<td valign="top" align="center">8349829<br>
</td>
<td valign="top" align="center">13901<br>
</td>
</tr>
<tr>
<td valign="top" align="center">4<br>
</td>
<td valign="top">Niko<br>
</td>
<td valign="top">Reza<br>
</td>
<td valign="top">Anggara<br>
</td>
<td valign="top" align="center">024<br>
</td>
<td valign="top" align="center">4284982<br>
</td>
<td valign="top" align="center">21211<br>
</td>
</tr>

</tbody>
</table>


I use python for HTML parsing, like this:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
t = soup.find("table")
dat = [ map(str, row.findAll('td', { 'rowspan' })) for row in t.findAll("tr") ]
print dat[1]


But still, I am really confused how to get the value of
colspan
.

I've already got the parsing tag table, but I don't know how to get the value of the
colspan
attribute. I've tried using a regex, but I didn't succeed.

Answer

I recommend using the CSS selectors:

from bs4 import BeautifulSoup
s = open("colspan_rowspan.html").read()
soup = BeautifulSoup(s)

# select all td children of tr that have both colspan & rowspan
tags = soup.select('tr td[colspan,rowspan]')

# print out the values, for example:
print [(td['colspan'], td['rowspan']) for tags]

# will return [('1', '2'), ('3', '1'), ('2', '1'), ('1', '2')]