CLWONG CLWONG - 2 months ago 19
Python Question

Get web page content (Not from source code)

I want to get the rainfall data of each day from here.

When I am in

inspect mode
, I can see the data. However, when I view the source code, I cannot find it.

I am using
urllib2
and
BeautifulSoup from bs4


Here is my code:

import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"

r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")


And I got an empty array.

My question is: How can I get the page content, but not from the page source code?

Answer

If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.

Try using selenium

How can I parse a website using Selenium and Beautifulsoup in python?

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')

html = driver.page_source
soup = BeautifulSoup(html)

print soup.find_all("td", class_="td1_normal_class")

However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.