Tia Tia - 1 year ago 188
Python Question

Scraping the news (Python 3.6, BeautifulSoup)

I want to scrape spiegel.de/schlagzeilen to get all the news-stuff which is shown below the dates (today, yesterday, to days ago).

<div class="schlagzeilen-content schlagzeilen-overview">


contains what I want, I think, but there is one problem left:

print(data)


keeps out the data I need, but in addition it comes with a bunch of phrases I don't want (like names of the integrated modules/ parts of HTML/CSS and so on)

So I chose

for item in data:
print(item.text)


This one has a very pretty output(!), but now I miss the article-URL, which is important to have. Is there anybody who can help me out? Here is my code:

from bs4 import BeautifulSoup
import requests

website = 'http://spiegel.de/schlagzeilen'
r = requests.get(website)
soup = BeautifulSoup((r.content), "lxml")

data = soup.find_all("div", {"class": "schlagzeilen-content schlagzeilen-overview"})

for item in data:
print(item.text)

Answer Source
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

website = 'http://spiegel.de/schlagzeilen'
r = requests.get(website)
soup = BeautifulSoup((r.content), "lxml")

div = soup.find("div", {"class": "schlagzeilen-content schlagzeilen-overview"})

for a in div.find_all('a', title=True):
    link = urljoin(website, a.get('href'))
    print(a.text, a.find_next_sibling('span').text)
    print(link)

out:

Südafrika: Dutzende Patienten sterben nach Verlegung (Panorama, 13:09)
http://spiegel.de/panorama/gesellschaft/suedafrika-verlegung-in-privatkliniken-dutzende-patienten-gestorben-a-1132677.html
Trumps Stotterstart: Ein Präsident, so unbeliebt wie keiner zuvor (Politik, 12:59)
http://spiegel.de/politik/ausland/donald-trump-als-us-praesident-so-unbeliebt-wie-kein-vorgaenger-a-1132554.html
Kontrolle von Gefährdern: Kabinett beschließt elektronische Fußfessel (Politik, 12:33)

The tag you need is a tag and it's sibling span which contains (Netzwelt, 12:23), so just use find_all and use a tag as an anchor.

And if you want the full path of the url, use urljoin

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download