susdu susdu - 4 months ago 13
Python Question

python BeautifulSoap table scraping

my HTML has several tables, the first table is:

<table>
<tr>
<td>
<div id="string">
</div>
</td>
</tr>
</table>


and the rest are of the form:

<table class="confluenceTable" data-csvtable="1">
<tbody>
<tr>
<th class="highlight-grey confluenceTh" data-highlight-colour="grey" rowspan="2" style="text-align: center;">Negev</th>


I want to scrape data from the tables. when I use:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = 'XXX'
soup = BeautifulSoup(urlopen(url).read(), "lxml")
for table in soup.findAll('table'):
print(table)


it only finds the first table. when I change the search to :

soup.findAll("table", { "class" : "confluenceTable" })


it doesn't find anything. What am I missing?

using python 3.4 on windows with BeautifulSoap 4.5

Answer

I suspect you are trying to scrape an Atlassian Confluence page which is usually quite dynamic and makes use of JavaScript intensively to load the page. If you look into the HTML source you download with urllib you would not find table elements with confluenceTable class.

Instead, you should either look into using Confluence API, or use a browser automation tool like selenium.

Comments