user3550783 user3550783 - 3 months ago 8
Python Question

Beautiful Soup Cleaning and Errors

I have this code:

from bs4 import BeautifulSoup
import urllib2
from lxml import html
from lxml.etree import tostring
trees = urllib2.urlopen('http://aviationweather.gov/adds/metars/index? station_ids=KJFK&std_trans=translated&chk_metars=on&hoursStr=most+recent+only&ch k_tafs=on&submit=Submit').read()
soup = BeautifulSoup(open(trees))
print soup.get_text()
item=soup.findAll(id="info")
print item


However, when I type soup on my window it gives me an error and when my program runs it gives me a very long html code with and so on. Any help would be greatful.

Answer

The first problem is in this part:

trees = urllib2.urlopen('http://aviationweather.gov/adds/metars/index?station_ids=KJFK&std_trans=translated&chk_metars=on&hoursStr=most+recent+only&chk_tafs=on&submit=Submit').read()
soup = BeautifulSoup(open(trees))

trees is a file-like object, there is no need to call open() on it, fix it:

soup = BeautifulSoup(trees, "html.parser")

We are also explicitly setting the html.parser as an underlying parser.


Then, you need to be specific about what you are going to extract from a page. Here is the example code to get the METAR text value:

from bs4 import BeautifulSoup
import urllib2


trees = urllib2.urlopen('http://aviationweather.gov/adds/metars/index?station_ids=KJFK&std_trans=translated&chk_metars=on&hoursStr=most+recent+only&chk_tafs=on&submit=Submit').read()
soup = BeautifulSoup(trees, "html.parser")

item = soup.find("strong", text="METAR text:").find_next("strong").get_text(strip=True).replace("\n", "")
print item

Prints KJFK 220151Z 20016KT 10SM BKN250 24/21 A3007 RMK AO2 SLP183 T02440206.