Jumpy89 Jumpy89 - 5 months ago 11
Python Question

How can I make my code more readable and DRYer when working with XML namespaces in Python?

Python's built-in

xml.etree
package supports parsing XML files with namespaces, but namespace prefixes get expanded to the full URI enclosed in brackets. So in the example file in the official documentation:

<actors xmlns:fictional="http://characters.example.com"
xmlns="http://people.example.com">
<actor>
<name>John Cleese</name>
<fictional:character>Lancelot</fictional:character>
<fictional:character>Archie Leach</fictional:character>
</actor>
...


The
actor
tag gets expanded to
{http://people.example.com}actor
and
fictional:character
to
{http://characters.example.com}character
.

I can see how this makes everything very explicit and reduces ambiguity (the file could have the same namespace with a different prefix, etc.) but it is very cumbersome to work with. The
Element.find()
method and others allow passing a
dict
mapping prefixes to namespace URIs so I can still do
element.find('fictional:character', nsmap)
but to my knowledge there is nothing similar for tag attributes. This leads to annoying stuff like
element.attrib['{{{}}}attrname'.format(nsmap['prefix'])]
.

The popular
lxml
package provides the same API with a few extensions, one of which is an
nsmap
property on the elements that they inherit from the document. However none of the methods seem to actually make use of it, so I still have to do
element.find('fictional:character', element.nsmap)
which is just unnecessarily repetitive to type out every time. It also still doesn't work with attributes.

Luckily
lxml
supports subclassing
BaseElement
, so I just made one with a
p
(for prefix) property that has the same API but automatically uses namespace prefixes using the element's
nsmap
(Edit: likely best to assign a custom
nsmap
defined in code). So I just do
element.p.find('fictional:character')
or
element.p.attrib['prefix:attrname']
, which much less repetitive and I think way more readable.

I just feel like I'm really missing something though, it really feels like this should really already be a feature of
lxml
if not the builtin
etree
package. Am I somehow doing this wrong?

Answer

Is it possible to get rid of the namespace mapping?

Do you need to pass it as a parameter into each function call? An option would be to set the prefixes to be used at the XML document in a property.

That's fine until you pass the XML document into a 3rd party function. That function wants to use prefixes as well, so it sets the property to something else, because it does not know what you set it to.

As soon as you get the XML document back, it was modified, so your prefixes don't work any more.

All in all: no, it's not safe and therefore it's good as it is.

This design does not only exist in Python, it also exists in .NET. The SelectNodes() [MSDN] can be used if you don't need prefixes. But as soon as there's a prefix present, it'll throw an exception. Therefore, you have to use the overloaded SelectNodes() [MSDN] which uses an XmlNamespaceManager as a parameter.

XPath as a solution

I suggest to learn XPath (lxml specific link), where you can use prefixes. Since this may be version specific, let me say I ran this code with Python 2.7 x64 and lxml 3.6.0 (I'm not too familiar with Python, so this may not be the cleanest code, but it serves well as a demonstration):

from lxml import etree as ET
from pprint import pprint
data = """<?xml version="1.0"?>
<d:data xmlns:d="dns">
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor d:name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
</d:data>"""
root = ET.fromstring(data)
my_namespaces = {'x':'dns'}
xp=root.xpath("/x:data/country/neighbor/@x:name", namespaces=my_namespaces)
pprint(xp)
xp=root.xpath("//@x:name", namespaces=my_namespaces)
pprint(xp)
xp=root.xpath("/x:data/country/neighbor/@name", namespaces=my_namespaces)
pprint(xp)

The output is

C:\Python27x64\python.exe E:/xpath.py
['Austria']
['Austria']
['Switzerland', 'Malaysia']

Process finished with exit code 0

Note how well XPath solved the mapping from x prefix in the namespace table to the d prefix in the XML document.

This eliminates the really awful to read element.attrib['{{{}}}attrname'.format(nsmap['prefix'])].

Short (and incomplete) XPath introduction

To select an element, write /element, optionally use a prefix.

xp=root.xpath("/x:data", namespaces=my_namespaces)

To select an attribute, write /@attribute, optionally use a prefix.

#See example above

To navigate down, concatenate several elements. Use // if you don't know items in between. To move up, use /... Attributes must be last if not followed by /...

xp=root.xpath("/x:data/country/neighbor/@x:name/..", namespaces=my_namespaces)

To use a condition, write it in square brackets. /element[@attribute] means: select all elements that have this attribute. /element[@attribute='value'] means: select all elements that have this attribute and the attribute has a specific value. /element[./subelement] means: select all elements that have a subelement with a specific name. Optionally use prefixes anywhere.

xp=root.xpath("/x:data/country[./neighbor[@name='Switzerland']]/@name", namespaces=my_namespaces)

There's much more to discover, like text(), various ways of sibling selection and even functions.

About the 'why'

The original question title which was

Why does working with XML namespaces seem so difficult in Python?

For some users, they just don't understand the concept. If the user understands the concept, maybe the developer didn't. And perhaps it was just one option out of many and the decision was to go that direction. The only person who could give an answer on the "why" part in such a case would be the developer himself.

References

Comments