Sudar Sudar - 6 months ago 130
Python Question

Parsing XML with namespace in Python via 'ElementTree'

I have the following XML which I want to parse using Python's

ElementTree
:

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://dbpedia.org/ontology/">

<owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
<rdfs:label xml:lang="en">basketball league</rdfs:label>
<rdfs:comment xml:lang="en">
a group of sports teams that compete against each other
in Basketball
</rdfs:comment>
</owl:Class>

</rdf:RDF>


I want to find all
owl:Class
tags and then extract the value of all
rdfs:label
instances inside them. I am using the following code:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')


Because of the namespace, I am getting the following error.

SyntaxError: prefix 'owl' not found in prefix map


I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.

Kindly let me know how to change the code to find all the
owl:Class
tags.

Answer

ElementTree is not too smart about namespaces. You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.