sravanthi sravanthi - 2 years ago 64
HTML Question

Extracting text between tags using a particular word

I'm trying to extract text between tags of a HTML page using a keyword. Here is an example.

<div class="xyz">Title</div>
<h4>Education</h4>
<p>PhD, 2017, Subject,<br />
ABC University </p>


I'm want to fetch PhD, 2017, Subject, ABC University. Here is what I tried:

r = requests.get(site)
soup = BeautifulSoup(r.content, "lxml")
for elems in soup(text=re.compile('PhD')):
val = elems.find_parent('p').getText()


This is printing all the 'p' tags containing PhD, can someone suggest how do I get the specific data under 'Education' field? I tried using partition too, which didn't provide successful outcome.

Answer Source

You can try to use lxml.html to get desired text:

import lxml.html as html

source = requests.get(site).content
html_obj = html.fromstring(source)
my_text = " ".join([text.strip() for text in html_obj.xpath('//h4[.="Education"]/following-sibling::p/text()')])
print(my_text)

Output

'PhD, 2017, Subject, ABC University'
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download