Rappster Rappster - 2 months ago 14
R Question

XPath and namespace specification for XML documents with an explicit default namespace

I'm struggling to get the correct combination of an XPath expression and the namespace specification as required by package

(argument
namespaces
) for a XML document that has an explicit
xmlns
namespace
defined at the top element.

UPDATE



Thanks to har07 I was able to put it together:

Once you query the namespaces, the first entry of
ns
has no name yet and that's the problem:

nsDefs <- xmlNamespaceDefinitions(doc)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))

> ns
omegahat r
"http://something.org" "http://www.omegahat.org" "http://www.r-project.org"


So we'll just assign a name that serves as a prefix (this can be any valid R name):

names(ns)[1] <- "xmlns"


Now all we have to do is using that default namespace prefix everywhere in our XPath expressions:

getNodeSet(doc, "/xmlns:doc//xmlns:b[@omegahat:status='foo']", ns)


For those interested in alternative solutions based on
name()
and
namespace-uri()
(amongst others) might find this post helpful.




Just for the sake of reference: this was the trial-and-error code before we came to the solution:

Consider the example from
?xmlParse
:

require("XML")

doc <- xmlParse(system.file("exampleData", "tagnames.xml", package = "XML"))

> doc
<?xml version="1.0"?>
<doc>
<!-- A comment -->
<a xmlns:omegahat="http://www.omegahat.org" xmlns:r="http://www.r-project.org">
<b>
<c>
<b/>
</c>
</b>
<b omegahat:status="foo">
<r:d>
<a status="xyz"/>
<a/>
<a status="1"/>
</r:d>
</b>
</a>
</doc>
nsDefs <- xmlNamespaceDefinitions(getNodeSet(doc, "/doc/a")[[1]])
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))
getNodeSet(doc, "/doc//b[@omegahat:status='foo']", ns)[[1]]


In my document, however, the namespaces are already defined in
<doc>
tag, so I adapted the example XML code accordingly:

xml_source <- c(
"<?xml version=\"1.0\"?>",
"<doc xmlns:omegahat=\"http://www.omegahat.org\" xmlns:r=\"http://www.r-project.org\">",
"<!-- A comment -->",
"<a>",
"<b>",
"<c>",
"<b/>",
"</c>",
"</b>",
"<b omegahat:status=\"foo\">",
"<r:d>",
"<a status=\"xyz\"/>",
"<a/>",
"<a status=\"1\"/>",
"</r:d>",
"</b>",
"</a>",
"</doc>"
)
write(xml_source, file="exampleData_2.xml")
doc <- xmlParse("exampleData_2.xml")
nsDefs <- xmlNamespaceDefinitions(doc)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))
getNodeSet(doc, "/doc", namespaces = ns)
getNodeSet(doc, "/doc//b[@omegahat:status='foo']", namespaces = ns)[[1]]


Everything still works fine. What's more, though, is that my XML code additionally has an explicit definition of the default namespace (
xmlns
):

xml_source <- c(
"<?xml version=\"1.0\"?>",
"<doc xmlns=\"http://something.org\" xmlns:omegahat=\"http://www.omegahat.org\" xmlns:r=\"http://www.r-project.org\">",
"<!-- A comment -->",
"<a>",
"<b>",
"<c>",
"<b/>",
"</c>",
"</b>",
"<b omegahat:status=\"foo\">",
"<r:d>",
"<a status=\"xyz\"/>",
"<a/>",
"<a status=\"1\"/>",
"</r:d>",
"</b>",
"</a>",
"</doc>"
)
write(xml_source, file="exampleData_3.xml")
doc <- xmlParse("exampleData_3.xml")
nsDefs <- xmlNamespaceDefinitions(doc)
ns <- structure(sapply(nsDefs, function(x) x$uri), names = names(nsDefs))


What used to work fails now:

> getNodeSet(doc, "/doc", namespaces = ns)
list()
attr(,"class")
[1] "XMLNodeSet"
Warning message:
using http://something.org as prefix for default namespace http://something.org

> getNodeSet(doc, "/xmlns:doc", namespaces = ns)
XPath error : Undefined namespace prefix
XPath error : Invalid expression
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces, :
error evaluating xpath expression /xmlns:doc
In addition: Warning message:
using http://something.org as prefix for default namespace http://something.org
getNodeSet(doc, "/xmlns:doc",
namespaces = matchNamespaces(doc, namespaces="xmlns", nsDefs = nsDefs)
)


This seems to get me closer:

> getNodeSet(doc, "/xmlns:doc",
+ namespaces = matchNamespaces(doc, namespaces="xmlns", nsDefs = nsDefs)
+ )[[1]]
<doc xmlns="http://something.org" xmlns:omegahat="http://www.omegahat.org" xmlns:r="http://www.r-project.org">
<!-- A comment -->
<a>
<b>
<c>
<b/>
</c>
</b>
<b omegahat:status="foo">
<r:d>
<a status="xyz"/>
<a/>
<a status="1"/>
</r:d>
</b>
</a>
</doc>

attr(,"class")
[1] "XMLNodeSet"


Yet, now I don't know how to proceed in order to get to the children nodes:

> getNodeSet(doc, "/xmlns:doc//b[@omegahat:status='foo']", ns)[[1]]
XPath error : Undefined namespace prefix
XPath error : Invalid expression
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces, :
error evaluating xpath expression /xmlns:doc//b[@omegahat:status='foo']
In addition: Warning message:
using http://something.org as prefix for default namespace http://something.org

> getNodeSet(doc, "/xmlns:doc//b[@omegahat:status='foo']",
+ namespaces = c(
+ matchNamespaces(doc, namespaces="xmlns", nsDefs = nsDefs),
+ matchNamespaces(doc, namespaces="omegahat", nsDefs = nsDefs)
+ )
+ )
list()
attr(,"class")
[1] "XMLNodeSet"

Answer

Namespace definition without prefix (xmlns="...") is default namespace. In case of XML document having default namespace, the element where default namespace declared and all of it's descendant without prefix and without different default namespace declaration are considered in that aforementioned default namespace.

Therefore, in your case you need to use prefix registered for default namespace at the beginning of all elements in the XPath, for example :

/xmlns:doc//xmlns:b[@omegahat:status='foo']

UPDATE :

Actually I'm not a user of r, but looking at some references on net something like this may work :

getNodeSet(doc, "/ns:doc//ns:b[@omegahat:status='foo']", c(ns="http://something.org"))