zondo zondo - 27 days ago 9
Python Question

Get all text in a tag unless it is in another tag

I'm trying to parse some HTML with BeautifulSoup, and I'd like to get all the text (recursively) in a tag, but I want to ignore all text that appears within a

small
tag. For example, this HTML:

<li>
<a href="/path">
Final
</a>
definition.
<small>
Fun fact.
</small>
</li>


should give the text
Final definition.
Note that this is a minimal example. In the real HTML, there are many other tags involved, so
small
should be excluded rather than
a
being included.

The
text
attribute of the tag is close to what I want, but it would include
Fun fact.
I could concatenate the text of all children except the
small
tags, but that would leave out
definition.
I couldn't find a method like
get_text_until
(the
small
tag is always at the end), so what can I do?

Answer

You can use find_all to find all the <small> tags, clear them, then use get_text():

>>> soup

<li>
<a href="/path">
    Final
  </a>
  definition.
  <small>
    Fun fact.
  </small>
</li>

>>> for el in soup.find_all("small"):
...     el.clear()
...
>>> soup

<li>
<a href="/path">
    Final
  </a>
  definition.
  <small></small>
</li>

>>> soup.get_text()
'\n\n\n    Final\n  \n  definition.\n  \n\n'
Comments