Kai Kai - 4 years ago 123
HTML Question

Using BeautifulSoup to scrape a specific website

Hi StackExchange Community!

I'm trying to run a code that would scrape the following website: http://apps.mmc.gov.my/searchmmc/main_search.php?action=detail&id=10000

Into a dataset which consist of the name, qualification, undergraduate degree, provisional registration number, as well as the Places of Practice underneath.

I've been struggling for a bit the last couple of days, due to the way the website is structured:



-->
</style>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>
:: MEDICAL REGISTER (MMC) ::
</title>
</head>
<body>
<table border="0" cellpadding="3" cellspacing="0" width="620">
<tr class="f10px-table-caption" valign="bottom">
<th colspan="3">
<table border="0" cellpadding="3" cellspacing="0">
<tr class="f10px-table-caption" valign="bottom">
<th align="left" class="f10px-table-caption" scope="col" width="25%">
NAME
</th>
<th class="f10px-table-caption" scope="col" width="5%">
:
</th>
<th align="left" class="f10px-table-caption" scope="col" width="70%">
JAPAR B ZAIRUN
</th>
<th rowspan="7">
<img height="140" src="showusrimg.php?PP_id=10000" width="100"/>
</th>
</tr>
<tr class="f10px-table-caption" valign="bottom">
<th align="left" class="f10px-table-caption" scope="col" width="25%">
QUALIFICATION
</th>
<th class="f10px-table-caption" scope="col" width="5%">
:
</th>
<th align="left" class="f10px-table-caption" scope="col" width="70%">
MD
</th>
</tr>
<tr class="f10px-table-caption" valign="bottom">
<th align="left" class="f10px-table-caption" scope="col" valign="top" width="25%">
UNDERGRADUATE OF
</th>
<th class="f10px-table-caption" scope="col" valign="top" width="5%">
:
</th>
<th align="left" class="f10px-table-caption" scope="col" valign="top" width="70%">
UNIVERSITI KEBANGSAAN MALAYSIA (UKM)
</th>
</tr>
<tr class="f10px-table-caption" valign="bottom">
<th align="left" class="f10px-table-caption" scope="col">
PROVISIONAL
<br/>
REGISTRATION NUMBER
</th>
<th class="f10px-table-caption" scope="col">
:
</th>
<th align="left" class="f10px-table-caption" scope="col">
20159
</th>
</tr>
<tr class="f10px-table-caption" valign="bottom">
<th align="left" class="f10px-table-caption" scope="col" width="25%">
DATE OF
<br/>
PROVISIONAL REGISTRATION
</th>
<th class="f10px-table-caption" scope="col" width="25%">
:
</th>
<th align="left" class="f10px-table-caption" scope="col" width="5%">
--
</th>
</tr>
<tr class="f10px-table-caption" valign="bottom">
<th align="left" class="f10px-table-caption" scope="col" width="25%">
FULL
<br/>
REGISTRATION NUMBER
</th>
<th class="f10px-table-caption" scope="col" width="5%">
:
</th>
<th align="left" class="f10px-table-caption" colspan="2" scope="col" width="70%">
31398
</th>
</tr>
<tr class="f10px-table-caption" valign="bottom">
<th align="left" class="f10px-table-caption" scope="col" width="25%">
DATE OF
<br/>
FULL REGISTRATION
</th>
<th class="f10px-table-caption" scope="col" width="5%">
:
</th>
<th align="left" class="f10px-table-caption" colspan="2" scope="col" width="70%">
16-06-1995
</th>
</tr>
</table>
</th>
</tr>
<tr>
<th bgcolor="#FFFFFF" class="f10px" scope="col">
</th>
<th bgcolor="#FFFFFF" class="f10px" scope="col">
</th>
<th bgcolor="#FFFFFF" class="f10px" scope="col">
</th>
</tr>
<tr class="f10px-table-caption" valign="bottom">
<th align="left" class="f10px-table-caption" colspan="3" scope="col">
*
</th>
</tr>
<tr class="f10px-table-caption" valign="bottom">
<th bgcolor="#FFFFFF" class="f10px-table-caption" colspan="3" scope="col">
<br/>
</th>
</tr>
<tr>
<th bgcolor="#FFFFFF" class="f10px-table-caption" colspan="3" scope="col">
<table align="center" border="1" bordercolor="#000000" bordercolordark="#000000" bordercolorlight="#000000" cellpadding="3" cellspacing="0" width="100%">
<tr bgcolor="#BDE3F9" class="f10px-table-header">
<th align="center" colspan="5">
APC
</th>
</tr>
<tr bgcolor="#BDE3F9" class="f10px-table-header">
<th align="center" width="4%">
#
</th>
<th align="center" width="10%">
APC YEAR
</th>
<th align="center" width="10%">
APC NO
</th>
<th align="left" width="38%">
PLACE OF PRACTICE (PRINCIPAL)
</th>
<th align="left" width="38%">
PLACE OF PRACTICE (OTHERS)
</th>
</tr>
<tr class="f10px-table-caption">
<td align="center">
1.
</td>
<td align="center">
2017
</td>
<td align="center">
15463
</td>
<td>
KLINIK ELOPURA SDN BHD
<br/>
NO. 31, GF, 2ND AVENUE
<br/>
90000 SANDAKAN
<br/>
SABAH
</td>
<td>
</td>
</tr>
<tr class="f10px-table-caption">
<td align="center">
2.
</td>
<td align="center">
2016
</td>
<td align="center">
13154
</td>
<td>
KLINIK ELOPURA SDN BHD
<br/>
NO. 31, GF, 2ND AVENUE
<br/>
90000 SANDAKAN
<br/>
SABAH
</td>
<td>
</td>
</tr>
<tr class="f10px-table-caption">
<td align="center">
3.
</td>
<td align="center">
2015
</td>
<td align="center">
10501
</td>
<td>
KLINIK ELOPURA SDN BHD
<br/>
NO. 31, GF, 2ND AVENUE
<br/>
90000 SANDAKAN
<br/>
SABAH
</td>
<td>
</td>
</tr>
</table>
</th>
</tr>
<th bgcolor="#FFFFFF" class="f10px-table-caption" colspan="3" scope="col">
Only the latest 3 years of the APC will be displayed as decided by the Council Members during the MMC meeting held on 12th July 2011.
</th>
</table>
<p>
<strong>
</strong>
</p>
<script language="javascript">
function GotoPage(pageno){

//obj=document.all.item("clue");
//obj2=document.all.item("cboSearch");
//obj2.value=obj.value;

obj=document.forms["searchuser"];
obj.action="main_search.php?action=search&page=" + pageno;
obj.submit();
}

function ShowDetails(id_pp){
win=window.open("main_search.php?action=detail&id=" + id_pp);
win.focus();
/*obj.action="main_search.php?action=detail&id=" + id_pp;
obj.submit();*/
}
</script>
</body>
</html>





If I can figure out how to extract the name and qualification in a structured manner, that would be a tremendous achievement in itself.

Thank you so much for taking the time for reading my post.

Answer Source
import requests, bs4

r = requests.get('http://apps.mmc.gov.my/searchmmc/main_search.php?action=detail&id=10000')
soup = bs4.BeautifulSoup(r.text, 'lxml')
for tr in soup.select('tr table tr[valign="bottom"]'):
    print(tr.get_text(strip=True))

out:

NAME:JAPAR B ZAIRUN
QUALIFICATION:MD
UNDERGRADUATE OF:UNIVERSITI KEBANGSAAN MALAYSIA (UKM)
PROVISIONALREGISTRATION NUMBER:20159
DATE OFPROVISIONAL REGISTRATION:--
FULLREGISTRATION NUMBER:31398
DATE OFFULL REGISTRATION:16-06-1995
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download