parsecer parsecer - 4 months ago 18
Python Question

Beautifulsoup: findAll recursive doesn't work

I'm trying to get articles from wired.com.
Generally their articles' content look like this:

<article itemprop="articleBody">
<p>Some text</p>
<p>Next text</p>
<p>...</p>
<p>...</p>
</article>


or like this:

<article itemprop="articleBody">
<div class="listicle-captions marg-t...">
<p></p>

</div>

</article>


So I want if the page is of type 1, the
<p>
and
<h>
are extracted, while if the page is of type 2 - do something else. So, if the
<p>
and
<h>
are direct descendants of
<article>
, then it's type 1.
I tried the following code, it looks for
<p>
and
<h>
and prints out the tag names. The thing is, the
recursive="False"
doesn't seem to help because when tested on type 2 page, it finds the tags, while it shouldn't (I espected to get a
NonType
object).

import urllib.request
from bs4 import BeautifulSoup
import datetime
import html
import sys

articleUrl="https://www.wired.com/2016/07/greatest-feats-inventions-100-years-boeing/"

soupArticle=BeautifulSoup(urllib.request.urlopen(articleUrl), "html.parser")

articleBody=soupArticle.find("article", {"itemprop":"articleBody"})
articleContentTags=articleBody.findAll(["h1", "h2","h3", "p"], recursive="False")

for tag in articleContentTags:
print(tag.name)
print(tag.parent.encode("utf-8"))


Why doesn't it work?

PS Also, is there a difference between using
findAll
and
findChildren
in general and in this particular case? These two look the same to me..

Answer

The string literal "False" is not the same as use the boolean False, you need to actually pass recursive=False:

articleBody.find_all(["h1", "h2","h3", "p"], recursive=False)

Any non empty string is going to be considered a truthy value , the only string you could pass that would work would be an empty string i.e recursive="".

In [17]: bool("False")
Out[17]: True

In [18]: bool("foo")
Out[18]: True

In [19]: bool("")
Out[19]: False

But stick to using the actual boolean False, also you will get an empty list/ResultSet returned with recursive=False, not None as you are calling find_all not find.