Carto_ Carto_ - 1 month ago 18
HTML Question

BeautifulSoup + xlwt : Put the content of a HTML table in Excel

I am trying (with a little python script) to put the content of a HTML table from a online webpage in an Excel sheet.

All is working well, except the "Excel thing".

#!/usr/bin/python
# --*-- coding:UTF-8 --*--

import xlwt
from urllib2 import urlopen
import sys
import re
from bs4 import BeautifulSoup as soup
import urllib

def BULATS_IA(name_excel):
""" Function for fetching the BULATS AGENTS GLOBAL LIST"""

ws = wb.add_sheet("BULATS_IA") # I add a sheet in my excel file

Countries_List = ['United Kingdom','Albania','Andorra']
Longueur = len(Countries_List)
number = 1


print("Starting to fetch ...")

for Countries in Countries_List:
x = 0
y = 0

print("Fectching country %s on %s" % (number, Longueur))
number = number + 1
htmlSource = urllib.urlopen("http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % (Countries)).read()
s = soup(htmlSource)
**tableauGood = s.findAll('table')
try:
rows = tableauGood[3].findAll('tr')
for tr in rows:
cols = tr.findAll('td')
y = 0
x = x + 1
for td in cols:
hum = td.text

ws.write(x,y,td.text)
y = y + 1
wb.save("%s.xls" % name_excel)**

except (IndexError):
pass

print("Finished for IA")



name_doc_out = raw_input("What do you want for name for the Excel output document ? >>> ")
wb = xlwt.Workbook(encoding='utf-8')
print("Starting with BULATS Agents, then with BULATS IA")
#BULATS_AGENTS(name_doc_out)
BULATS_IA(name_doc_out)


--
So anything is going in the Excel Sheet, but when I print the content of the var ... I see what I should see !

I'm trying to fix it since one hour but I still don't understand what's going one.
If some of you can give me a hand, It should be VERY nice.

Answer

I have try your application. And I am very sure that the output of td.text is same as the excel file. So what's your question? If the content is not what you want, you should check the usage of BeautifulSoap. Further more, you may need to do following:

           for td in cols:
                hum =  td.text.replace(" ", " ")
                print hum
                ws.write(x,y,hum)