YP41 YP41 - 11 months ago 51
CSS Question

Beautifulsoup - scraping everything but table data

Hi I'm new to python and currently trying to download data from a table on a website (http://www.pa.org.mt/AppList?ReceivedDate=2016-8-31)

I've tried many different solutions but everything I try keeps returning an empty list. I read that the problem might be that the table is loaded using Javascript however when I switch off Javascript the table remains, and I can obviously see the data I want when I view the source code.

I am using python 2.7

When I run this code:

from bs4 import BeautifulSoup
import urllib2

url = 'http://www.pa.org.mt/AppList?ReceivedDate=2016-8-31
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
print soup

Where the table should be I get:

<link href="appsearch/main.css" rel="stylesheet" type="text/css" />
<TABLE id="Table1" cellSpacing="1" cellPadding="1" width="100%" border="0">
<TD align=center>

<br />
<br />

When I view the page source code however I can see the information I want (I've copied and pasted a small part of it

<TD align=center>
<p align='center' class='H1'><u>Planning Authority Applications Received (Planning Applications Outside Development Zone)</u></p><p align='center'>Result For Date 2016-8-31</p><p align='center'>Result output on 03/09/2016 23:23:29</p><strong><i>Disclaimer</strong>: The information ....in accordance with the Development Planning Act.</i>
<br />
<br />
<table class='formTable'><tr><td class='sectionHeading' colspan=2>Application Details</td></tr></table><table class='formTable'><TR><td class='sectionHeading'>Case Number</td><td class='sectionHeading'>Location</td><td class='sectionHeading'>Proposal</td><td class='sectionHeading'>Applicant</td><td class='sectionHeading'>Architect</td><td class='sectionHeading'>Case Category</td><td class='sectionHeading'>Local Council</td></tr><TR><td class='fieldData'><a href='SearchPA?Systemkey=166837&CaseFullRef=PA/05054/1

I would be great if you could give me any suggestions or point me in the direction of any material that might help me out.

As I said earlier, I'm new to both python and stackoverflow so my apologies if a similar question has already been answered or if I haven't given the right information.


Answer Source

If you clear your cache and go directly to http://www.pa.org.mt/appsreceived?month=01/08/2016 you see no data at all just like you see in your own output:

enter image description here

You need to use a session and visit the page preceding the page you want first:

import  requests
head = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"}

with requests.Session() as s:

    r2 = (s.get("http://www.pa.org.mt/AppList?ReceivedDate=2016-8-31"))

Now the next problem, the html is broken so html.parse will not get what you want:

In [4]: with requests.Session() as s:
   ...:         s.headers.update(head)
   ...:         r= s.get("http://www.pa.org.mt/appsreceived?CaseType=PA&Category=PAI")
   ...:         page = (s.get("http://www.pa.org.mt/AppList?ReceivedDate=2016-8-31").content)
   ...:         soup = BeautifulSoup(page, 'html.parser')
   ...:         print(soup.select_one("#Table1"))
<table border="0" cellpadding="1" cellspacing="1" id="Table1" width="100%">
<td align="center">
<p align="center" class="H1"><u>Planning Authority Applications Received (Planning Applications Within Development Zone)</u></p><p align="center">Result For Date 2016-8-31</p><p align="center">Result output on 04/09/2016 01:56:44</p><strong><i>Disclaimer</i></strong>: The information below has been extracted from an on-line database and is meant only for your general guidance.The Planning Authority disclaims any responsibility for any inaccuracies there may be on this site. If you wish to verify the correctness of any information then you are advised to contact us directly. Furtheremore, in the event of any discrepancies between the information contained on this site and official printed communication then the latter is to prevail, in accordance with the Development Planning Act.</td></tr></table>

lxml or html5lib will, I won't add the output as it is quite large but using either parser will give you the full table data.