amphibient amphibient - 1 month ago 8
Python Question

BeautifulSoup navigation ignores specified path

It appears as though my

BeautifulSoup
parser ignores the path of the element I request and returns the first tag found that bears the name of the final element in the path regardless of the path up to that point.

XML:

<root>
<firstcategory>
<subcategory>
<id>123</id>
<name>SubcategX</name>
</subcategory>
<id>789</id>
<name>Category1</name>
</firstCategory>
</root>


Python
code:

from bs4 import BeautifulSoup

testXML = "<root><firstcategory><subcategory><id>123</id><name>SubcategX</name></subcategory><id>789</id><name>Category1</name></firstCategory></root>"

soup = BeautifulSoup(testXML)
#below should be 789
categID = soup.root.firstcategory.id
#this prints 123, which corresponds to the path root.firstcategory.subcategory.id, not root.firstcategory.id
print("categID = %s" % categID)


Why does BeautifulSoup simply find the first id tag in the hierarchy irrespective of the specified path?

Answer

When you use the dot syntax, BeautifulSoup is searching all ancestors recursively. It happens to find the subcategory <id> first.

To prevent recursion, you can do:

soup.firstcategory.find('id', recursive=False)

Here are the docs for the recursive argument.