tubelius tubelius - 2 months ago 62
R Question

In R how do I pair XML node values from common parent nodes?

I have following example XML:

<body>
<div class="row">
<div class="column">
<span class="title">Color</span>
</div>
<div class="column property">Blue</div>
</div>
<div class="row">
<div class="column">
<span class="title">Shape</span>
</div>
<div class="column property">Square</div>
</div>
</body>


How could I use R to pair each title to their property and output:

Color = Blue
Shape = Square


I tried following script, but the title has XML tags around and property is missing:

library(XML)

getDetails <- function(id) {
html <- htmlTreeParse( "exampleXML.html" ,useInternal = TRUE)
xpathSApply( html , "//div[@class='row']" , function(row) {
print( xmlElementsByTagName(row, "span", recursive = TRUE) )
})
}

getDetails()


Also no luck with:

library(XML) #to install use: install.packages("XML")
library(xml2) #to install use: install.packages("xml2")
library(magrittr) #to install use: install.packages("magrittr")

extract_info <- function(x){
title <- x %>% xml_find_first(".//span[@class='title']") %>% xml_text
property <- x %>% xml_find_first(".//div[@class='column property']") %>% xml_text
setNames(property, title)
}

html <- htmlTreeParse( "exampleXML.html" ,useInternal = TRUE)
html %>% xml_find_all("//div[@class='row']") %>% extract_info



Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "c('HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"

Answer

Consider using a nested xpathSApply() where outer loop iterates across rows to parse corresponding values of each row's title and property:

library(XML)

example_html <- paste0('<body>',
                   '  <div class="row">',
                   '    <div class="column">',
                   '       <span class="title">Color</span>',
                   '    </div>',
                   '    <div class="column property">Blue</div>',
                   '  </div>',
                   '  <div class="row">',
                   '    <div class="column">',
                   '       <span class="title">Shape</span>',
                   '    </div>',
                   '    <div class="column property">Square</div>',
                   '  </div>', 
                   '</body>')

doc <- htmlTreeParse(example_html, useInternal = TRUE)

columns <- xpathSApply(doc, "//div[@class='row']", function(row){
   title <- xpathSApply(row, "div[@class='column']/span", xmlValue)
   property <- xpathSApply(row, "div[@class='column property']", xmlValue)
   setNames(gsub(" ", "", property), gsub(" ", "", title))    # GSUB TO STRIP WHITESPACE
})

columns <- setNames(property, title)
columns
#  Color    Shape 
#  "Blue" "Square" 

Alternatively, assuming strict consistency in rows without missing child elements or multiple same named elements for title and property values, consider a couple of xpathSApply() calls:

title <- xpathSApply(doc, "//div[@class='column']/span", xmlValue)
property <- xpathSApply(doc, "//div[@class='column property']", xmlValue)

columns <- setNames(property, title)
columns
#   Color    Shape 
#  "Blue" "Square" 
Comments