DjH DjH - 2 months ago 18
Python Question

Beautifulsoup get text based on nextSibling tag name

I'm scraping multiple pages that all have a similar format, but it changes a little here and there and there are no classes to use to search for what I need.

The format looks like this:

<div id="mainContent">

<p>Some Text I don't want</p>
<p>Some Text I don't want</p>
<p>Some Text I don't want</p>
<span> More text I don't want</span>
<ul>...unordered-list items..</ul>

<p>Text I WANT</p>
<ol>...ordered-list items..</ol>

<p>Text I WANT</p>
<ol>...ordered-list items..</ol>

</div>


The number of ordered/unordered lists and other tags changes depending on the page, but what stays the same is I always want the text from the
<p>
tag that is the previous sibling of the
<ol>
tag.

What I'm trying (and isn't working) is:

main = soup.find("div", {"id":"mainContent"})

for d in main.children:
if d.name == 'p' and d.nextSibling.name == 'ol':
print(d.text)
else:
print("fail")


The out put of this is
fail
for every iteration. In trying to figure out why this isn't working I tried:

for d in main.children:
if d.name == 'p':
print(d.nextSibling.name)
else:
print("fail")


The output of this is something like:

fail
None
fail
None
fail
None
fail
fail
fail
fail
fail
None
fail


etc...

Why is this not working like I think it would? How could I get the text from a
<p>
element only if the next tag is
<ol>
?

Answer

You want only the p tags which are before ol tag. Find the ol tags first and then find the previous Tag objects which are in this case, the p tag. Now your code is not working because, there is a newline between the Tag elements which are NavigableString type objects. And d.nextSibling yields you those newlines also. So You have to check the type of the object here.

from bs4 import Tag
# create soup
# find the ols
ols = soup.find_all('ol')
for ol in ols:
     prev = ol.previous_sibling
     while(not isinstance(prev, Tag)):
         prev = prev.previous_sibling
     print(prev.text)

This will give you the text you want.

Text I WANT
Text I WANT
Comments