Bhavesh Ghodasara Bhavesh Ghodasara - 2 months ago 11
Python Question

parse table using beautifulsoup in python

I want to traverse through each row and capture values of td.text. However problem here is table does not have class. and all the td got same class name. I want to traverse through each row and want following output:

1st row)"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C" (new line)

2nd row) "AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior"," C" (new line)

<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
<tbody>
<tr class="tblHeading">
<td colspan="7">AMERICANS SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Cameron Coya </td>
<td width="19%" class="tdUnderLine">
Rozel, Max
</td>
<td width="06%" class="tdUnderLine">
09-11-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
<a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=228004" target="_blank">228004</a>
</td>
<td width="16%" class="tdUnderLine" align="center">
09/10/16 02:15 PM
</td>
<td width="30%" class="tdUnderLine"> player persistently infringes the laws of the game </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
<tr class="tblHeading">
<td colspan="7">AVIATORS SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td>
</tr>
<tr bgcolor="#FBFBFB">
<td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Saskia Reyes </td>
<td width="19%" class="tdUnderLine">
HollaenderNardelli, Eric
</td>
<td width="06%" class="tdUnderLine">
09-11-2016
</td>
<td width="05%" class="tdUnderLine" align="center">

<a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=224463" target="_blank">224463</a>
</td>
<td width="16%" class="tdUnderLine" align="center">
09/11/16 06:45 PM
</td>
<td width="30%" class="tdUnderLine"> player/sub guilty of unsporting behavior </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
<tr class="tblHeading">
<td colspan="7">BERGENFIELD SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Christian Latorre </td>
<td width="19%" class="tdUnderLine">
Coyle, Kevin
</td>
<td width="06%" class="tdUnderLine">
09-10-2016
</td>
<td width="05%" class="tdUnderLine" align="center">

<a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=226294" target="_blank">226294</a>
</td>
<td width="16%" class="tdUnderLine" align="center">

09/10/16 11:00 AM

</td>
<td width="30%" class="tdUnderLine"> player persistently infringes the laws of the game </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>


I tried with following code.

import requests
from bs4 import BeautifulSoup
import re
try:
import urllib.request as urllib2
except ImportError:
import urllib2

url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html"
page = open(url, encoding="utf8")
soup = BeautifulSoup(page.read(),"html.parser")

#tableList = soup.findAll("table")

for tr in soup.find_all("tr"):
for td in tr.find_all("td"):
print(td.text.strip())


but it is obvious that it will return text form all td and I will not able to identify particular column name or will not able to determine start of new record. I want to know

1) how to identify each column(because class name is same) and there are headings as well (I will appreciate if you provide code for that)

2) how to identify new record in such structure

Answer
import re
import datetime
from bs4 import BeautifulSoup

soup = ""
with open("/tmp/a.html") as page:
   soup = BeautifulSoup(page.read(),"html.parser")

tableList = soup.findAll("table")

def chunks(l, n): 
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n): 
        yield l[i:i + n]

for table in tableList:
   for trs in chunks(table.find_all('tr'), 3): 
      try:
          extracted_text = [re.sub(r'([^\x00-\x7F])+','', tr.text) for tr in trs]
          extracted_text = [x.strip() for x in ''.join(extracted_text).split('\n')]
          extracted_text = filter(lambda x: len(x) > 2, extracted_text)
          extracted_text.pop(3)
          extracted_text.pop(3)
          extracted_text[3] = "Player " + extracted_text[3]
          extracted_text[4] = datetime.datetime.strptime(extracted_text[4], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
          extracted_text[-1] = 'C'
          extracted_text = ['"' + x + '"' for x in extracted_text]
          print ','.join(extracted_text)
      except:
          pass

And when run:

$ python a.py

"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C"
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C"
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"
Comments