ParagM ParagM - 1 month ago 10
Python Question

Separating two texts within same <td> tag using Python BeautifSoup

I have very limited HTML knowledge and I am only beginning on Beautiful soup, so my question may not be framed correctly.
My HTML source codes look something like this

<TD width="15%">Text1</TD>
<TD width="85%">Text2<A href="link1">(6)</A>
Text3<A href="link2">(4)</A>
</TD>


It appears on webpage as Text1/Text2 and Text1/Tex3 (may be due to some code which I do not understand and I may not have copied here).

However, I am trying to write a Python code with BeautifulSoup to parse this information in Python object. I thought first step would be just to extract the texts separately and then merge them later. I am able to extract Text1 easily by using code like this

url = "my url (static page stored locally)"
soup = BeautifulSoup(open(url),'lxml')
t1_soup=soup.find_all('td',{'width':'15%'})
t2_soup=soup.find_all('td',{'width':'75%'})


text1_str=[]
for item in t1_soup:
text1_str.append(item.text)


text2_str=[]
for item in t2_soup:
text2_str.append(item.text)


The first for loop gives me text1 cleanly but second for loop gives me a single string 'text2 text3'. I am not sure how to separate them so that I can eventually convert this to text1/text2 and text1/text3

It is possible that the python code I wrote is also not very efficient and if you have a suggestion to better approach this, I would appreciate it.

Answer

You can solve it by finding all a elements inside td and getting the previous text siblings:

for item in t2_soup:
    print([a.previous_sibling.strip() for a in item.find_all("a")])

Prints [u'text2', u'text3'].

Or, you can find all the text nodes in every td non-recursively:

for item in t2_soup:
    print([text.strip() for text in item.find_all(text=True, recursive=False)])

This may produce extra empty strings - make sure to filter them.