Valeria Lobos Ossandón Valeria Lobos Ossandón - 2 months ago 17
Python Question

Crawl a news website and getting the news content

I'm trying to download the text from a news website. The HTML is:

<div class="pane-content">
<div class="field field-type-text field-field-noticia-bajada">
<div class="field-items">
<div class="field-item odd">
<p>"My Text" target="_blank">www.injuv.cl</a></strong></p> </div>


The output should be: My Text
I'm using the following python code:

try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = "My URL"
parsed_html = BeautifulSoup(html)
p = parsed_html.find("div", attrs={'class':'pane-content'})
print(p)


But the output of the code is: "None". Do you know what is wrong with my code??

Answer

The problem is that you are not parsing the HTML, you are parsing the URL string:

html = "My URL"
parsed_html = BeautifulSoup(html)

Instead, you need to get/retrieve/download the source first, example in Python 2:

from urllib2 import urlopen

html = urlopen("My URL")
parsed_html = BeautifulSoup(html)

In Python 3, it would be:

from urllib.request import urlopen

html = urlopen("My URL")
parsed_html = BeautifulSoup(html)

Or, you can use the third-party "for humans"-style requests library:

import requests

html = requests.get("My URL").content
parsed_html = BeautifulSoup(html)

Also note that you should not be using BeautifulSoup version 3 at all - it is not maintained anymore. Replace:

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

with just:

from bs4 import BeautifulSoup