plzhelpmi plzhelpmi - 4 months ago 26
Python Question

BeautifulSoup returning [] when I run it

I am using Beautiful soup with python to retrieve weather data from a website.

Here's how the website looks like:

<channel>
<title>2 Hour Forecast</title>
<source>Meteorological Services Singapore</source>
<description>2 Hour Forecast</description>
<item>
<title>Nowcast Table</title>
<category>Singapore Weather Conditions</category>
<forecastIssue date="18-07-2016" time="03:30 PM"/>
<validTime>3.30 pm to 5.30 pm</validTime>
<weatherForecast>
<area forecast="TL" lat="1.37500000" lon="103.83900000" name="Ang Mo Kio"/>
<area forecast="SH" lat="1.32100000" lon="103.92400000" name="Bedok"/>
<area forecast="TL" lat="1.35077200" lon="103.83900000" name="Bishan"/>
<area forecast="CL" lat="1.30400000" lon="103.70100000" name="Boon Lay"/>
<area forecast="CL" lat="1.35300000" lon="103.75400000" name="Bukit Batok"/>
<area forecast="CL" lat="1.27700000" lon="103.81900000" name="Bukit Merah"/>`
<channel>


I would like to retrieve 3.30 pm to 5.30 pm which is between validTime

After inspecting elements from the page, I found that 3.30 pm to 5.30 pm is in the "class = Text" within the
<span>
element:

Inspect element of the website

Based on the webiste, here are my python codes:

import requests
from bs4 import BeautifulSoup

url = "http://www.nea.gov.sg/api/WebAPI/?dataset=2hr_nowcast&keyref=<keyrefnumber>"

r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

g_data = soup.find_all("span", {"class": "text"})

print g_data

# to print out the file in 3.30pm to 5:30pm to an XML file
outfile = open('C:\scripts\idk.xml','w')


When I run my python codes in CMD, all I got was
[]
.

Answer

The main API page on the Singapore NEA site shows clearly that the response you get is an XML document:

2-hour Nowcast
Data Description: Weather forecast for next 2 hours
Last API Update: 1-Mar-2016
Frequency Hourly
File Type: XML

You are looking at a HTML representation of the data in Chrome; Chrome transformed the XML to make it presentable in some way, but your Python code is still accessing the XML directly. The PDF documentation and your own question show the actual XML contents, parse those.

If you want to use BeautifulSoup with XML, make sure you have the lxml project installed and use the 'xml' parser type. Then simply access the text content of the validTime element:

soup = BeautifulSoup(r.content, "xml")
valid_time = soup.find('validTime').string

Demo:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.nea.gov.sg/api/WebAPI/?dataset=2hr_nowcast&keyref=<private_api_key>')
>>> soup = BeautifulSoup(r.content, "xml")
>>> soup.find('validTime').string
u'4.00 pm to 6.00 pm'

If you are trying to write to an XML file, you'd have to make sure it is writing valid XML however; this is outside the scope of BeautifulSoup.

Alternatively, use the ElementTree API that comes with Python by default; it can both parse the XML and produce new XML.

Comments