macright macright - 2 months ago 9
R Question

XML Processing in R: Use xmlGetAttr in children nodes

I have several XML files with a structure similar to the following one:

<?xml version='1.0' encoding='UTF-8'?>

<text>

<stage></stage>

<div>
<intro agent= "Peter"></intro>
<dialogue agent= "Peter"></dialogue>
<outro agent= "Stephen"></outro>
</div>

<div>
<intro agent= "Sandra"></intro>
<dialogue agent= "Peter"></dialogue>
<outro agent= "Robert"></outro>
</div>

<stage></stage>

</text>


My goal is to get a list of all "agents". I came up with

agents <- xmlApply(xml_processed[["test.xml"]], xmlGetAttr, "agent", default= "-")


but this would only give me the corresponding values if they are in the "div"-node. xml_processed is

# preprocess XML

preprocess_xml <- function() {
xmlfiles <- list.files("data/XML", pattern = "*.xml")
path <- "data/XML"
xmlfiles_path <- file.path(path, xmlfiles)

xmlcontent <- list()

for(i in 1:length(xmlfiles)) {
xmlcontent[[xmlfiles[i]]] <- xmlTreeParse(xmlfiles_path[i])
}

xmlfinal <- list()

for(i in 1:length(xmlcontent)) {
xmlfinal[[xmlfiles[i]]] <- xmlRoot(xmlcontent[[i]])
}
return(xmlfinal)
}


I also tried

agents <- xmlApply(xml_processed[["test.xml"]], "/text/div/intro", xmlGetAttr, "agent", default= "-")


to get the agent of the intro node. But this would only give me an error:

get(as.character(FUN), mode = "function", envir = envir)

Answer

Methinks it's time to focus more on XPath than R:

txt <- '<?xml version="1.0" encoding="UTF-8"?> 
<text>
  <stage></stage>
    <div>
      <intro agent= "Peter"></intro>
        <dialogue agent= "Peter"></dialogue>
      <outro agent= "Stephen"></outro>
    </div>
    <div>
     <intro agent= "Sandra"></intro>
        <dialogue agent= "Peter"></dialogue>
     <outro agent= "Robert"></outro>
    </div>
  <stage></stage>
</text>'

library(xml2)
library(magrittr)

doc <- read_xml(txt)
xml_find_all(doc, ".//*[@agent]") %>% 
  xml_attr("agent")

If you must use the XML package:

library(XML)

doc <- xmlParse(txt)
xpathSApply(doc, "//*[@agent]", xmlGetAttr, "agent")
Comments