mjd876 mjd876 - 2 months ago 7
R Question

R-XML pulling nodes into matrix/DF accounting for missing nodes

I am fairly new to using R and very new to using the XML package and xpath. I need to pull four elements from an xml file that looks like this (except that I have trimmed off a lot of other xmlnodes to simplify it here):

<?xml version="1.0" encoding="utf-8"?>
<iati-activities version="1.03" generated-datetime="2015-07-07T16:49:09+00:00">
<iati-activity last-updated-datetime="2014-08-11T14:36:59+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100530</iati-identifier>
<title>Improvement of basic health care</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<participating-org role="Funding" ref="EU" type="15">EU</participating-org>
<participating-org role="Funding" type="21">Cordaid Memisa</participating-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="21">CORDAID RCA</participating-org>
<recipient-country percentage="100" code="CF">CENTRAL AFRICAN REPUBLIC</recipient-country>
<budget type="1">
<period-start iso-date="2010-01-01"></period-start>
<period-end iso-date="2013-02-28"></period-end>
</budget>
</iati-activity>
<iati-activity last-updated-datetime="2013-07-19T14:12:14+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100625</iati-identifier>
<title>Pigs for Pencils</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Funding" type="60">Stichting Kapatiran</participating-org>
<participating-org role="Implementing" type="22">PREDA Foundation Inc.</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
<period-start iso-date="2010-04-20"></period-start>
<period-end iso-date="2012-10-02"></period-end>
<value value-date="2010-04-20">12500</value>
</budget>
</iati-activity>
<iati-activity last-updated-datetime="2015-04-08T03:01:58+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100815</iati-identifier>
<title>Job and housing opportunities for women </title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="22">WISE</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
<period-start iso-date="2010-10-01"></period-start>
<period-end iso-date="2011-12-31"></period-end>
<value value-date="2010-10-01">227000</value>
</budget>
</iati-activity>
</iati-activities>


Also this is my first question ever on StackOverflow, so apologies if I'm not doing it correctly (and bc that xml is not perfectly aligned).
The elements I need, and what I'm assigning them to are:

UniqueID <- "//iati-activity/iati-identifier"


GrantTitle <- "//iati-activity/title"


GrantAmount <- "//iati-activity/budget/value"


Recipient <- "//iati-activity/participatingorg[@role='Implementing']"


So far (after much trial and tribulation) I have come up with this code, that goes through the current node (x), pulling the 4 variables, and cbinding them into a row, then using xpathApply to loop through iati-activity nodes calling the function and rbinding the resulting rows together.

This code works when all four elements exist in each activity. However, note the absence of the budget/value node from the xml sample. This is because I removed it in order to solve this problem of missing nodes, which occur frequently in the full file for almost all the elements I need.

Also note the [1] at the end of my xpath expressions- I've included these because there are also multiple titles, multiple participating-orgs of all types, etc.

Given the multiples of some elements and the nonexistence of others, it makes it impossible to simple pull all the same elements into a vector and pop it into a data frame. Thus the need to loop through each activity pulling the elements with it. My code currently doesn't work to account for missing elements (the missing budget/value in the first iati-activity) because cbinding (and rbinding) ignore null vectors.

xmltestNA = xmlInternalTreeParse("XMLtoDF_TestNA.xml", useInternalNodes=TRUE)
bodyToDF <- function(x){
UniqueID <- xpathSApply(x, "./iati-identifier", xmlValue)
GrantTitle <- xpathSApply(x, "./title[1]", xmlValue)
GrantAmount <- xpathSApply(x, "./budget/value[1]", xmlValue)
Recipient <- xpathSApply(x, "./participating-org[@role='Implementing'][1]", xmlValue)
cbind(UniqueID=UniqueID, GrantTitle=GrantTitle, GrantAmount=GrantAmount, Recipient=Recipient)
}
res <-xpathApply(xmltestNA, '//iati-activity', fun=bodyToDF)
IatiNA <-do.call(rbind, res)
IatiNA


How can I keep the null values/missing nodes in order to turn it into a matrix or dataframe that looks like this:

UniqueID GrantTitle GrantAmount Recipient
1 NL-KVK-41160054-100530 Improvement of basic health care NA CORDAID RCA
2 NL-KVK-41160054-100625 Pigs for Pencils 12500 PREDA Foundation Inc.
3 NL-KVK-41160054-100815 Job and housing opportunities for women 227000 WISE


Because I'm still new, the simpler the code, the better. Thanks in advance!

Answer

If your xpath queries return too many or few results, I think it's easier to work with the nodes

doc <- xmlParse( '<your xml here>')
nodes<- getNodeSet(doc, "//iati-activity")

#Compare
xpathSApply(doc, "//budget/value", xmlValue)
xpathSApply(doc, "//participating-org[@role='Funding']", xmlValue)

sapply(nodes, function(x) xpathSApply(x, "./budget/value", xmlValue))
sapply(nodes, function(x) xpathSApply(x, "./participating-org[@role='Funding']", xmlValue))

Add a function to handle missing or multiple nodes and then create the data.frame

xpath2 <-function(x, path, fun = xmlValue, ...){
   y <- xpathSApply(x, path, fun, ...)
   ifelse(length(y) == 0, NA,
    ifelse(length(y) > 1, paste(y, collapse=", "), y))
}

GrantAmount <- sapply(nodes, xpath2("./budget/value")
UniqueID  <- sapply(nodes, xpath2("./iati-identifier")
GrantTitle <- sapply(nodes, xpath2("./title")
Recipient <-  sapply(nodes, xpath2("./participating-org[@role='Implementing']")
## updated xpath2 so xmlGetAttr will also work
Funding_ref <- sapply(nodes, xpath2("./participating-org[@role='Funding']", xmlGetAttr, "ref")

data.frame(UniqueID, GrantTitle, GrantAmount, Recipient)
                UniqueID                               GrantTitle GrantAmount             Recipient
1 NL-KVK-41160054-100530         Improvement of basic health care        <NA>           CORDAID RCA
2 NL-KVK-41160054-100625                         Pigs for Pencils       12500 PREDA Foundation Inc.
3 NL-KVK-41160054-100815 Job and housing opportunities for women       227000                  WISE
Comments