Arqam Arqam - 7 months ago 8
Python Question

How to extract text using BeautifulSoup when same tag exist which are not useful

I am doing a little bit of web scrapping and I need the text in between the paragraphs

<p>
:

div>
<small>147 out of 252 people found the following review useful:</small><br>
<a href="/user/ur0935867/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMTkzOTQxMDY2MV5BMl5BanBnXkFtZTgwNjA3NjgwMDE@._V1._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
<h2>Unbelievable and way overrated</h2>
<img width="102" height="12" alt="3/10" src="http://i.media-imdb.com/images/showtimes/30.gif"><br>
<b>Author:</b>
<a href="/user/ur0935867/">glenmoreland</a> <small>from Holland</small><br>
<small>18 January 2016</small><br>
<p><b>*** This review may contain spoilers ***</b></p>

</div>
<p>
I cannot believe how many people think this is a good movie....watching
a guy struggle to survive for 2 hours ...come on people..I know there
are not many good movies being made but my word....so many things are
unbelievable...the bear attack, carrying a near dead guy out of the
wilderness up a mountain...going over a cliff on a horse and not
getting hurt...spending long periods of time in freezing cold
water.....surviving extreme cold overnight inside a dead horse...my god
the list is endless....and for Leo&#x27;s so called acting don&#x27;t get me
started...a lot of crawling and moaning and groaning....the whole thing
was a letdown and really a waste of time...also tell the director to
back the camera up a bit on those facial close-ups...they were also
ridiculous...trust me save your money and go see The Hateful Eight.
</p>

<div class="yn" id="ynd_3398112">

<form method="get"

action="reviews"

>
Was the above review useful to you?


I just need the review in between the the
<p>
tag. And in the source code of the page there are many
<p>
tag which does not contain reviews. How can I get the text of the reviews using BeautifulSoup?

Ps : Source code from http://www.imdb.com/title/tt1663202/reviews?ref_=tt_ov_rt

Answer

Have you tried something like this?

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)

for tag in soup.findAll('p'):
        soup_tag = BeautifulSoup(str(tag))
        b_tag = soup_tag.findAll('b')
        if len(b_tag) == 0:
            review = tag

print review

or, even better, you could try find_previous_sibling('p') or using that <div class="yn" id="ynd_3398112"> tag. i noticed that the review is not inside that <div> tag, so you could use this info to access the data you're looking for. sorry, but your question is not clear.

Comments