Matt Matt - 3 months ago 7
Python Question

Trouble parsing XML with python

I have parsed an XML file with BeautifulSoup in Python and I am having trouble extracting the data out of it. An example of the structure of the XML is below:

<Products page="0" pages="-1" records="27">
<Product id="ABC001">
<Name>This product name</Name>
<Cur>USD</Cur>
<Tag>Text</Tag>
<Classes>
<Class id="USD">
<ClassCur>USD</ClassCur>
<Identifier>XYZ123456</Identifier>
</Class>
</Classes>
</Product>
<Product id="XYZ002">
<Name>That product name</Name>
<Cur>EUR</Cur>
<Tag>More Text</Tag>
<Classes>
<Class id="EUR">
<ClassCur>EUR</ClassCur>
<Identifier>VDSHG123456</Identifier>
</Class>
</Classes>
</Product>
</Products>


The first thing I have been trying to accomplish but have so far failed to do is to extract all of the Product and Class id's
"ABC001"
,
"XYZ002"
etc..

What I have tried is

products = soup.find_all("Product")

for p in products:
print(p.find("name")) # gets the name tag
print(p.find("cur")) # gets the cur tag
# ...etc


However, I can't figure out how to access
id
within
Product
. For example,
p.find("product")
returns
None
.

Note that while I am using bs4 I don't have to - it's just that I have done a lot of web scraping with Python + bs4 and have found bs4 to be useful in navigating through HTML, so assumed it would be the ideal way of handling XML.

Answer

id is an attribute of Product, not a child element, so you access it with:

p['id']
Comments