SilverWingedSeraph SilverWingedSeraph - 11 days ago 5
Python Question

Get maximum nesting of tags with BeautifulSoup

I'm looking for a way, given a BeautifulSoup-parsed document, to find what the maximum level of nesting is.

E.g. I need

magic_function
in:

r = requests.get("http//example.com")
soup = BeautifulSoup(r.text)
depth = magic_function(soup)


Which, for, e.g., this document, would return 4:

<html>
<body>
<p>
<strong>Some Text.</strong>
<strong>Some Text.</strong>
<strong>Some Text.</strong>
</p>
<p>
<strong>Some Text.</strong>
<strong>Some Text.</strong>
<strong>Some Text.</strong>
</p>
</body>
</html>


Some ideas I've had:


  1. Is there a function in BeautifulSoup to do this? Looking at docs and Googling has availed me nothing.

  2. Is there another library that would allow me to do this? Again, Googling has availed me nothing, but I may simply not know what to search for.

  3. Should I try just traversing the tree with a function I've built on my own? I'd really rather not, but I could certainly do that.


Answer

Traversing the tree with your own magic_function() isn't difficult. You could use a simple recursive function like:

def magic_function(soup):
    if hasattr(soup, "contents") and soup.contents:
        return max([magic_function(child) for child in soup.contents]) + 1
    else:
        return 0

You would want to call the function using the document's top-level html tag so that it doesn't count the nesting within the soup object as a nesting level.

Using your above document structure, this function call returns 4:

>>> magic_function(soup.html)
4