Tyrion Tyrion - 3 months ago 12
HTML Question

Error parsing text from specific class in beautifulsoup soup

I want to extract the text from a complicated newssite. Example HTML code:

<div class="article-section clearfix">

<p style="line-height:1em;">
<span class="spTextSmaller">Wenig Zeit? Am Textende gibt's eine Zusammenfassung. </span>
</p>

<p>
<hr noshade="1"/>
</p>
<p>Das Ende der Welt ...</p>

<p>Brzezinski ... </p>

<p>blubla</p>

<p>tututut</p>

<div class="asset-box spPhotoGallery spPhotoGalleryZitat article-quote-gallery">
<div class="asset-title">"Ich werde es niemals ausschlie├čen"</div>
<div class="zitat-box">
<a href="" class="zitat-box-button"><img src="..." height="48" width="48" alt="Zitate starten" />
</a><a href="..." title="Zitate starten" class="zitat-box-content">... </a></div>
<p>Zitate starten: Klicken Sie auf den Pfeil</p>
</div>

<p>Blabla ..."</p>

<p>randomletters </p>
</div>


I want to extract all the text between the tags of the class "article-section clearfix so in this example

['Das Ende der Welt ... ', 'Brzezinski ... ', 'blubla', 'tututut', 'Blabla ..."', 'randomletters ']


It should not include "Zitate starten: Klicken Sie auf den Pfeil" which is the p-tag in the big div class "asset-box spPhotoGallery .." .

Currently I am using

textlist=[]
for tag in articlesoup.find_all("p"):
if(tag.parent["class"]==['article-section', 'clearfix']):
textlist.append(tag.get_text() + "\n")
else: continue


This results in


KeyError: 'class


'
For other pages of the website without the div class="asset-box spPhotoGallery spPhotoGalleryZitat article-quote-gallery" it works fine. Classes like ['article-image-description'] are no problem. I have not found an answer to my problem in the other beautifulsoup related questions here.

Getting all text from the ['article-section', 'clearfix'] class by

for tag in articlesoup.find_all("div", class_=['article-section, 'clearfix']): tag. get_text()


results in too much unwanted stuff, thats why I have to stick with the above mentioned solution probably.
Should I use some try method to avoid the error? Thanks for any help in advance

Answer

What you want is to use recursive=False:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
for p in soup.find("div",class_="article-section clearfix").find_all("p", recursive=False):
    text = p.find(text=True, recursive=False).strip()
    if text:
        print(text)

soup.find("div",class_="article-section clearfix").find_all("p", recursive=False) gets all the p who are children of the div then text = p.find(text=True, recursive=False).strip() finds any text that is under the p tags excluding children of the tags.

You can see running on your sample we get what we want:

In [8]: html = """<div class="article-section clearfix">
  <p styfrom bs4 import BeautifulSoup
        html = """<div class="article-section clearfix">
  <p styfrom bs4 import BeautifulSoup
        html = """<div class="article-section clearfix">
  <p style="line-height:1em;">
<span class="spTextSmaller">Wenig Zeit? Am Textende gibt's eine Zusammenfassung. </span>
</p>
<p>
<hr noshade="1"/>
</p>
  <p>Das Ende der Welt ...</p>
<p>Brzezinski ... </p>
  <p>blubla</p>
  <p>tututut</p>
  <div class="asset-box spPhotoGallery spPhotoGalleryZitat article-quote-gallery">
    <div class="asset-title">"Ich werde es niemals ausschlie├čen"</div>
        <div class="zitat-box">
                <a href="" class="zitat-box-button"><img src="..." height="48" width="48" alt="Zitate starten" />
                </a><a href="..." title="Zitate starten" class="zitat-box-content">... </a></div>
            <p>Zitate starten: Klicken Sie auf den Pfeil</p>
        </div>
  <p>Blabla ..."</p>
  <p>randomletters </p>
</div>"""

Then:

In [9]: soup = BeautifulSoup(html, "html.parser")

In [10]: for p in soup.find("div", class_="article-section clearfix").find_all("p", recursive=False):
   ....:         text = p.find(text=True, recursive=False).strip()
   ....:         if text:
   ....:                 print(text)
   ....:         
Das Ende der Welt ...
Brzezinski ...
blubla
tututut
Blabla ..."
randomletters

Why your code errors is that you are recursively looking for p tags and somewhere at least one of the p tags has a parent that has no class which the following demonstrates:

In [13]: html = """<div><p>foo</p> <>"""

In [14]: soup = BeautifulSoup(html, "html.parser")

In [15]: print(soup.find("p").parent["class"])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-c4921fc9e631> in <module>()
----> 1 print(soup.find("p").parent["class"])

/usr/lib/python3/dist-packages/bs4/element.py in __getitem__(self, key)
    956         """tag[key] returns the value of the 'key' attribute for the tag,
    957         and throws an exception if it's not there."""
--> 958         return self.attrs[key]
    959 
    960     def __iter__(self):

KeyError: 'class'

In [16]: html = """<div class="bar"><p>foo</p> <>"""

In [17]: soup = BeautifulSoup(html, "html.parser")

In [18]: print(soup.find("p").parent["class"])
['bar']