parsecer parsecer - 3 months ago 18
Python Question

findAll doesn't work, nested tags

I'm parsing through this page. I need to get text content - which is located in

p
tags. The general structure of the page is the following:

<html>
<body>
<article itemprop="articleBody">
<div...>
<div...>
<figure>
<span..></span>
<p>THE TEXT</p>
</figure>
</div>
</div>
</article>
</body>
</html>


So the
p
is not a direct child of
article
but it is still inside, and
findAll
should be able to find it. But it doesn't.

articleBody=soupArticle.find("article", {"itemprop":"articleBody"})
textList=articleBody.findAll("p")
print(len(textList)) #gives 0


What am I doing wrong here?

Answer

This should get you started:

from bs4 import BeautifulSoup
import mechanize

url = "https://www.wired.com/2016/08/live-debate-whats-right-kind-intersection"
br = mechanize.Browser()
response = br.open(url)
soup = BeautifulSoup(response, 'html.parser')


media = []
 for x in soup.findAll("script",{"type":"text/javascript"}):
media.append(x.get_text().split("*/"))

med = media[4][1].split("<p>")
strin=[]
for i, element in enumerate(med):
    strin.append("")
    for char in element:
        strin[i]+=char
        if char=="<":
            break

for text in strin:
   print text

You can encode the texts in 'utf-8' if you want.