J.jaques J.jaques - 2 months ago 9
HTML Question

web data scraping : split html content

I'm scraping a website and I was able to reduce a variable called "gender" to this :

[<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>, <span style="text-decoration: none;">associé gérant </span>]


And now I'd like to have only "associé" in the variable but I can't find a way to split this html code.

The reason is that I want to know if it's "associé" (male) or "associée" (female).

does anyone have any ideas ?

Cheers

----- edit ----
here my code which gets me the html output

url = "http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false"

r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
table = soup.select_one("#adm").find_next("table") #select_one finds only the first tag that matches a selector:
table2 = soup.select_one("#adm").find_all_next("table")


output = table.select("td span[style^=text-decoration:]", limit=2) #.text.split(",", 1)[0].strip()

print(output)




Answer

Whatever the parent of the two elements is you can call span:nth-of-type(2) to get the second span, then just check the text:

html = """<span style="text-decoration: none;">
                        Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
                    </span>
           <span style="text-decoration: none;">associé gérant </span>"""

soup = BeautifulSoup(html)

text = soup.select_one("span:nth-of-type(2)").text

Or if it not always the second span you can search for the span by the partial text associé:

import re
text = soup.find("span", text=re.compile(ur"associé")).text

For your edit, all you need is to extract the text last element and use .split(None, 1)[1] to get the gender:

text = table.select("td span[style^=text-decoration:]", limit=2)[-1].text
gender = text.split(None, 1)[1] # > gérant