parsecer parsecer - 1 year ago 70
Python Question

Beautifulsoup: findAll recursive doesn't work

I'm trying to get articles from
Generally their articles' content look like this:

<article itemprop="articleBody">
<p>Some text</p>
<p>Next text</p>

or like this:

<article itemprop="articleBody">
<div class="listicle-captions marg-t...">



So I want if the page is of type 1, the
are extracted, while if the page is of type 2 - do something else. So, if the
are direct descendants of
, then it's type 1.
I tried the following code, it looks for
and prints out the tag names. The thing is, the
doesn't seem to help because when tested on type 2 page, it finds the tags, while it shouldn't (I espected to get a

import urllib.request
from bs4 import BeautifulSoup
import datetime
import html
import sys


soupArticle=BeautifulSoup(urllib.request.urlopen(articleUrl), "html.parser")

articleBody=soupArticle.find("article", {"itemprop":"articleBody"})
articleContentTags=articleBody.findAll(["h1", "h2","h3", "p"], recursive="False")

for tag in articleContentTags:

Why doesn't it work?

PS Also, is there a difference between using
in general and in this particular case? These two look the same to me..

Answer Source

The string literal "False" is not the same as use the boolean False, you need to actually pass recursive=False:

articleBody.find_all(["h1", "h2","h3", "p"], recursive=False)

Any non empty string is going to be considered a truthy value , the only string you could pass that would work would be an empty string i.e recursive="".

In [17]: bool("False")
Out[17]: True

In [18]: bool("foo")
Out[18]: True

In [19]: bool("")
Out[19]: False

But stick to using the actual boolean False, also you will get an empty list/ResultSet returned with recursive=False, not None as you are calling find_all not find.