jules jules - 1 month ago 11
Python Question

Solved! Finding href by anchor text inside table/list

I'm trying use python bs4 to extract a href with a specific anchor text from a website into wich I succesfully logged in (using requests) before.

Here is the pseudo html of the landing page:

<table class="submissions">
<thead>some thead</thead>
<tbody><tr class="active">
<th scope="row">uninterestingtext</th>
<td>uninterestingtext</td><td></td>
</tr>
<tr class="active">
<th scope="row">uninteresting</th>
<td>uninteresting text</td><td></td></tr>
<tr class="lastrow active"><th scope="row">uninteresting</th>
<td>uninteresting text</td>
<td></td>
</tr>
<tr class="lastrow inactive">
<th scope="row">uninteresting text</th>
<td>uninterestingtext
<ul>
<li><a href="uninteresting_href">someLink</a> </li>
<li><a href="uninteresting_href">someLink</a> </li>
<li><a href=**InterestingLink**>**Upload...**</a></li>
</ul>
</td>
</tr></tbody></table>


Now I am trying to extract the InterestingLink by looking for the Upload... text between 'a' tags.

Here is what I tried:

landing_page_soup = BeautifulSoup(*responseFromSuccessfulLogin*.text, 'html.parser')
important_page = landing_page_soup.find('a',{'href':True,'text':'Upload...'}).get('href')


But this always throws the error

AttributeError: 'NoneType' object has no attribute 'get'


because "important_page" is always "None".

Note: I have made sure, that "responseFromSuccessfulLogin.text" is the correct html, which contains the desired links.

After reading other forum threads about similar problems I modified the line to use the method 'select' to query for css-selectors as well as the method 'findAll' with no success.

I feel like I'm messing up, because It's a table, the links are inside.

I am looking forward to any help!
Greets

UPDATE:

important_page = landing_page_soup.find('a', title='Upload...')['href']


works pereftly for me. I get only the link I want.

Answer

BeautifulSoup accept callable objects.

html = BeautifulSoup(response.content, 'html.parser')
important_page = html.findAll('a', href=True, text=lambda i: i if 'Upload...' in i else False)

print(important_page[0]['href'])
Comments