amphibient amphibient - 2 months ago 19
Python Question

BeautifulSoup navigation ignores specified path

It appears as though my

parser ignores the path of the element I request and returns the first tag found that bears the name of the final element in the path regardless of the path up to that point.




from bs4 import BeautifulSoup

testXML = "<root><firstcategory><subcategory><id>123</id><name>SubcategX</name></subcategory><id>789</id><name>Category1</name></firstCategory></root>"

soup = BeautifulSoup(testXML)
#below should be 789
categID =
#this prints 123, which corresponds to the path, not
print("categID = %s" % categID)

Why does BeautifulSoup simply find the first id tag in the hierarchy irrespective of the specified path?


When you use the dot syntax, BeautifulSoup is searching all ancestors recursively. It happens to find the subcategory <id> first.

To prevent recursion, you can do:

soup.firstcategory.find('id', recursive=False)

Here are the docs for the recursive argument.