Sasha Sasha - 1 month ago 6
R Question

Extracting data from XML in R

I need to extract certain data from XML that looks like this (simplified for brevity)

<Doc name="Doc1">
<Lists Count="1">
<List Name="List1">
<Points Count="3">
<Point Id="1">
<Tags Count ="1">"a"</Tags>
<Point Position="1" />
</Point>
<Point Id="2">
<Point Position="2" />
</Point>
<Point Id="3">
<Tags Count="1">"c"</Tags>
<Point Position="3" />
</Point>
</Points>
</List>
</Lists>
</Doc>


The output should be a data frame that matches a tag and position to each point Id

Point Tag Position
1 1 a 1
2 2 <NA> 2
3 3 c 3


I am new to XML, I was playing with xml2 package. So far, I could extract each variable separately, but since some points may not have a Tag data , I can't find a way to match between the three parameters.

> library(xml2)
> xml_data<-read_xml(...)
> xml_data %>% xml_find_all("//Point") %>% xml_attr("Id")
[1] "1" "2" "3"
> xml_data %>% xml_find_all("//Vertical") %>% xml_attr("Position")
[1] "1" "2" "3"
> xml_data %>% xml_find_all("//Tags") %>% xml_text()
[1] "\"a\"" "\"c\""

Answer

purrr and xml2 go well together:

library(xml2)
library(purrr)

txt <- '<Doc name="Doc1">
    <Lists Count="1">
        <List Name="List1">
            <Points Count="3">
                <Point Id="1">
                    <Tags Count ="1">"a"</Tags>
                    <Point Position="1"  /> 
                </Point>
                <Point Id="2">
                    <Point Position="2"  /> 
                </Point>
                <Point Id="3">
                    <Tags Count="1">"c"</Tags>
                    <Point Position="3"  /> 
                </Point>
            </Points>
        </List>
    </Lists>
</Doc>'

doc <- read_xml(txt)
xml_find_all(doc, ".//Points/Point") %>% 
  map_df(function(x) {
    list(
      Point=xml_attr(x, "Id"),
      Tag=xml_find_first(x, ".//Tags") %>%  xml_text() %>%  gsub('^"|"$', "", .),
      Position=xml_find_first(x, ".//Point") %>% xml_attr("Position")
    )
  })
## # A tibble: 3 × 3
##   Point   Tag Position
##   <chr> <chr>    <chr>
## 1     1     a        1
## 2     2  <NA>        2
## 3     3     c        3
Comments