edsheeran edsheeran - 3 months ago 14
Python Question

Can't find string after a tag with BeautifulSoup in Python?

In this HTML I want to get the string of it but no matter what I try it doesn't work (string = none)

<a href="/analyze/default/index/49398962/1/34925733" target="_blank">
<img alt="" class="ajax-tooltip shadow radius lazy" data-id="acctInfo:34925733_1" data-original="/upload/profileIconId/default.jpg" src="/images/common/transbg.png"/>
Jue VioIe Grace
</a>


There's a few of these on the page and I tried this:

print([a.string for a in soup.findAll('td', class_='tou')])


The output is just none.

EDIT: here is the entire page HTML, hope this helps, just to clarify, I need to find all instances like the one above and extract their string

http://pastebin.com/4mvcMsJu

Answer

You need to select the a from the parent td and call .text, the text is inside the anchor which is a child of the td:

print([td.a.text for td in soup.find_all('td', class_='tou')])

There obviously is a td with the class tou or you would not be getting a list with None:

In [10]: html = """<td class='tou'>
          <a href="/analyze/default/index/49398962/1/34925733" target="_blank">
       <img alt="" class="ajax-tooltip shadow radius lazy" data-id="acctInfo:34925733_1" data-original="/upload/profileIconId/default.jpg" src="/images/common/transbg.png"/>
       Jue VioIe Grace
      </a>
      </td>"""

In [11]: soup = BeautifulSoup(html,"html.parser")

In [12]: [a.string for a in soup.find_all('td', class_='tou')]
Out[12]: [None]

In [13]: [td.a.text for td in soup.find_all('td', class_='tou')]
Out[13]: [u'\n\n       Jue VioIe Grace\n      ']

You could also call .text on the td:

In [14]: [td.text for td in soup.find_all('td', class_='tou')]
Out[14]: [u'\n\n\n       Jue VioIe Grace\n      \n']

But that would maybe get more than you want.

using your full html from pastebin:

In [18]: import requests

In [19]: soup = BeautifulSoup(requests.get("http://pastebin.com/raw/4mvcMsJu").content,"html.parser")

In [20]: [td.a.text.strip() for td in soup.find_all('td', class_='tou')]
Out[20]: 
 [u'KElTHMCBRlEF',
 u'game 5 loser',
 u'Cris',
 u'interestingstare',
 u'ApoIlo Price',
 u'Zary',
 u'Adrian Ma',
 u'Liquid Inori',
 u'focus plz',
 u'Shiphtur',
 u'Cody Sun',
 u'ApoIIo Price',
 u'Pobelter',
 u'Jue VioIe Grace',
 u'Valkrin',
 u'Piggy Kitten',
 u'1 and 17',
 u'BLOCK IT',
 u'JiaQQ1035716423',
 u'Twitchtv Flaresz']

In this case td.text.strip() gives you the same output:

In [23]: [td.text.strip() for td in soup.find_all('td', class_='tou')]
Out[23]: 
[u'KElTHMCBRlEF',
 u'game 5 loser',
 u'Cris',
 u'interestingstare',
 u'ApoIlo Price',
 u'Zary',
 u'Adrian Ma',
 u'Liquid Inori',
 u'focus plz',
 u'Shiphtur',
 u'Cody Sun',
 u'ApoIIo Price',
 u'Pobelter',
 u'Jue VioIe Grace',
 u'Valkrin',
 u'Piggy Kitten',
 u'1 and 17',
 u'BLOCK IT',
 u'JiaQQ1035716423',
 u'Twitchtv Flaresz']

But you should understand that there is a difference. Also the difference between .string vs .text

Comments