cmvdi01 cmvdi01 - 2 months ago 10
R Question

Dealing with non-existing nodes in R xml to data frame

I have a very similar case to this one (Load XML to Dataframe in R with parent node attributes), where I’m trying to convert xml to a df, but I’m unable to deal with the non-existing nodes “sp” and “l”. (I do not care about node “m”).
Suppose my xml looks like this:

<text>
<body>
<div1 type="scene1” n="1">
<sp who="fau">
<l c="30" a="Settle thy studies"/>
<m x="40" b="To sound the depth of that thou wilt profess"/>
</sp>
<sp who="eang">
<m x="105" b="Go forward, Faustus, in that famous art"/>
</sp>
</div1>
<div1 type="scene2” n="2">
<sp who="fau">
<l c="31" a="Settle thy"/>
<m x="50" b="To sound the depth of"/>
</sp>
<sp who="fau">
<l c="32" a="Settle"/>
<m x="60" b="To sound the"/>
</sp>
<sp who="fau">
<l c="33" a="Settle thy studies, Faustus"/>
<m x="40" b="To sound the depth of that thou wilt"/>
</sp>
</div1>
<div1 type="scene3” n="3">
</div1>
<div1 type="scene4” n="4">
</div1>
<div1 type="scene5” n="5">
</div1>
</body>
</text>


This is what I would like to obtain:

n type lc la
1 scene1 30 Settle thy studies
2 scene2 31 Settle thy
2 scene2 32 Settle
2 scene2 33 Settle thy studies, Faustus
3 scene3 NA NA
4 scene4 NA NA
5 scene5 NA NA


I’ve tried this:

doc = xmlTreeParse("play.xml", useInternal = TRUE)

bodyToDF <- function(x){
n <- xmlGetAttr(x, "n")
type <- xmlGetAttr(x, "type")
sp <- xpathApply(x, 'sp', function(sp) {
if(is.null(sp)) {
lc <- NA
la <- NA
}
lc <- xpathSApply(sp, 'l', function(l) { xmlGetAttr(l,"c")})
la = xpathSApply(sp, 'l', function(l) { xmlValue(l,"a")})
data.frame(n, type, lc, la)
})
do.call(rbind, sp)
}


res <- xpathApply(doc, '//div1', bodyToDF)


but it doesn’t work:

Error in data.frame(n, type, lc, la) :
arguments imply differing number of rows: 1, 0


and also this:

div1 = sapply(c("n","type"), function(x) xpathSApply(doc, "//div1", xmlGetAttr, x), simplify=FALSE)

l = sapply(c("c","a"), function(x) xpathSApply(doc, "//l", xmlGetAttr, x), simplify=FALSE)

df <- data.frame(div1,l)


but I can’t seem to get the correct match between the nodes and df rows:

Error in data.frame(div1, l) :
arguments imply differing number of rows: 5, 4


Any ideas? Thank you.

Answer

Your pasted XML text has issues (some double quotes aren't plain double quotes) so here's a good version of it for others:

txt <- '<text>
    <body>
        <div1 type="scene1" n="1">
            <sp who="fau">
                <l c="30" a="Settle thy studies"/>
                <m x="40" b="To sound the depth of that thou wilt profess"/>
            </sp>
            <sp who="eang">
                <m x="105" b="Go forward, Faustus, in that famous art"/>
            </sp>
        </div1>
        <div1 type="scene2" n="2">
            <sp who="fau">
                <l c="31" a="Settle thy"/>
                <m x="50" b="To sound the depth of"/>
            </sp>
            <sp who="fau">
                <l c="32" a="Settle"/>
                <m x="60" b="To sound the"/>
            </sp>
            <sp who="fau">
                <l c="33" a="Settle thy studies, Faustus"/>
                <m x="40" b="To sound the depth of that thou wilt"/>
            </sp>
        </div1>
        <div1 type="scene3" n="3"></div1>
        <div1 type="scene4" n="4"></div1>
        <div1 type="scene5" n="5"></div1>
    </body>
</text>'

The following can be translated back to XML syntax if truly necessary, but the idea is similar to other answers where you need to inspect each "scene" node and handle the missing values use-case if it occurs:

library(xml2)
library(purrr)
library(dplyr)

doc <- read_xml(txt)

xml_find_all(doc, ".//*[contains(@type, 'scene')]") %>% 
  map_df(function(x) {

    scene <- xml_attr(x, "type")
    num <- xml_attr(x, "n")

    lines <- xml_find_all(x, ".//l")

    if (length(lines) == 0) {
      data_frame(n=num, scene=scene, lc=NA, la=NA)
    } else {
      map_df(lines, function(y) {
        lc <- xml_attr(y, "c") %||% NA
        la <- xml_attr(y, "a") %||% NA
        data_frame(n=num, scene=scene, lc=lc, la=la)
      })
    }

  })

And, that gives you your desired output:

## # A tibble: 7 × 4
##       n  scene    lc                          la
##   <chr>  <chr> <chr>                       <chr>
## 1     1 scene1    30          Settle thy studies
## 2     2 scene2    31                  Settle thy
## 3     2 scene2    32                      Settle
## 4     2 scene2    33 Settle thy studies, Faustus
## 5     3 scene3  <NA>                        <NA>
## 6     4 scene4  <NA>                        <NA>
## 7     5 scene5  <NA>                        <NA>
Comments