Kumakaja Kumakaja - 19 days ago 5
Python Question

Find a text by its text with can contain noise

I want to find a link which contains a text and some noise by BeautifulSoup4:

<a href="#">
<span>gggggggggggg</span>
Some text123
<div>fdsfdsfdsfd</div>
<span> fdsfdsfdsfd</span>
</a>


When I'm trying to find it by "Some text123", it fails:

soup123.find("a", "Some text123") # => NoneType


What is the solution for this?

update:

The "a" isn't a single one, there can be many of them. But the "a" with "Some text123" is unique.

Answer

The following might suit your needs. It simply finds all a tags and determines if the search text you are looking for is present. It then displays the associated href tag for any matching entries:

from bs4 import BeautifulSoup

html = """
    <a href="#1"><span>gggggggggggg</span>Some text123<div>fdsfdsfdsfd</div><span> fdsfdsfdsfd</span></a>
    <a href="#2"><span>gggggggggggg</span>Some text124<div>fdsfdsfdsfd</div><span> fdsfdsfdsfd</span></a>"""

soup = BeautifulSoup(html, "html.parser")
search = "Some text123"

for a in soup.find_all('a'):
    if search in a.text:
        print a['href']

So for my example, it would display:

#1