Variax Variax - 11 days ago 6
R Question

Transforming data from xml into R dataframe

I'm trying to convert an xml file to a dataframe, but the format seems to be off. I've looked at different tutorials and, while I've been moderately succesful at getting the information I need using a for loop and navigating the parsed file, I've been told that this solution is not very efficient.

I tried this code then:

require(XML)
parsed<-xmlParse("SEWL.xml")
xmlToDataFrame(parsed)


But it gives an error: Error in
[<-.data.frame
(
*tmp*
, i, names(nodes[[i]]), value = c("\"LL18179\"\"2016/08\"0.32485.43896.59801.2131\"OK\"", :
duplicate subscripts for columns

This other code works, but the formatting is not what I need:

require(XML)
require(plyr)
pldf<-ldply(xmlToList("SEWL.xml"),data.frame)


The resulting dataframe is as follows:

.id X..i.. text .attrs test.code test.validuntil test.meas.text test.meas..attrs test.meas.text.1
1 technician "John" <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 location "CO" <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 temp <NA> 21.3 celsius <NA> <NA> <NA> <NA> <NA>
4 runtype "routine" <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 sample <NA> <NA> 2323 "LL18179" "2016/08" 0.3248 baseline 5.4389
6 sample <NA> <NA> 2323 "LL18179" "2016/08" 0.3248 baseline 5.4389
7 sample <NA> <NA> 8979237 "AA09453" "2016/03" 0.0117 baseline 5.6012
8 sample <NA> <NA> 8979237 "AA09453" "2016/03" 0.0117 baseline 5.6012
9 .attrs 2015_07_31_11_33_22 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 .attrs 20150731 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
11 .attrs 113322 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
test.meas..attrs.1 test.meas.text.2 test.meas..attrs.2 test.calc test.result test..attrs test.code.1 test.validuntil.1
1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 std 6.5980 data 1.2131 "OK" laslum "ATR150607" "2017/05"
6 std 6.5980 data 1.2131 "OK" 3 "ATR150607" "2017/05"
7 std 1.1431 data 0.2041 "FAIL" absat <NA> <NA>
8 std 1.1431 data 0.2041 "FAIL" 2 <NA> <NA>
9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
11 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
test.meas.text.3 test.meas..attrs.3 test.meas.text.4 test.meas..attrs.4 test.meas.text.5 test.meas..attrs.5
1 <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> <NA> <NA> <NA>
5 0.0673 baseline 4.9721 std 10.3851 data
6 0.0673 baseline 4.9721 std 10.3851 data
7 <NA> <NA> <NA> <NA> <NA> <NA>
8 <NA> <NA> <NA> <NA> <NA> <NA>
9 <NA> <NA> <NA> <NA> <NA> <NA>
10 <NA> <NA> <NA> <NA> <NA> <NA>
11 <NA> <NA> <NA> <NA> <NA> <NA>
test.calc.1 test.result.1 test..attrs.1
1 <NA> <NA> <NA>
2 <NA> <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 2.0886 "Warning" atr
6 2.0886 "Warning" 1
7 <NA> <NA> <NA>
8 <NA> <NA> <NA>
9 <NA> <NA> <NA>
10 <NA> <NA> <NA>
11 <NA> <NA> <NA>


This is the example XML file that I'm using:

<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
<technician>"John"</technician>
<location>"CO"</location>
<temp scale="celsius">21.3</temp>
<runtype>"routine"</runtype>
<sample id="2323">
<test name="laslum" order="3">
<code>"LL18179"</code>
<validuntil>"2016/08"</validuntil>
<meas name="baseline">0.3248</meas>
<meas name="std">5.4389</meas>
<meas name="data">6.5980</meas>
<calc>1.2131</calc>
<result>"OK"</result>
</test>
<test name="atr" order="1">
<code>"ATR150607"</code>
<validuntil>"2017/05"</validuntil>
<meas name="baseline">0.0673</meas>
<meas name="std">4.9721</meas>
<meas name="data">10.3851</meas>
<calc>2.0886</calc>
<result>"Warning"</result>
</test>
</sample>
<sample id="8979237">
<test name="absat" order="2">
<code>"AA09453"</code>
<validuntil>"2016/03"</validuntil>
<meas name="baseline">0.0117</meas>
<meas name="std">5.6012</meas>
<meas name="data">1.1431</meas>
<calc>0.2041</calc>
<result>"FAIL"</result>
</test>
</sample>
</experiment>


And the dataframe that I'm hoping to get:

experiment technician location temp runtype sample test order code validuntil baseline std data calc result date time
1 abc123 John CO 21.3 routine 2323 laslum 3 LL18179 2016/08 0.3248 5.4389 6.5980 1.2131 OK 20150731 113322
2 abc123 John CO 21.3 routine 2323 atr 1 ATR150607 2017/05 0.0673 4.9721 10.3851 2.0886 Warning 20150731 113322
3 abc123 John CO 21.3 routine 8979237 absat 2 AA09453 2016/03 0.0117 5.6012 1.1431 0.2041 FAIL 20150731 113322


I don't need the exact same format, just something close enough so I can transform it into the example.

Answer

Using Lines in the Note at the end:

library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

do.call("rbind", xpathApply(doc, "//experiment", function(x) {
  data.frame(experiment = xmlAttrs(x)[["name"]],
       technician = xmlValue(x[["technician"]]),
       location = xmlValue(x[["location"]]),
       temp = xmlValue(x[["temp"]]),
       runtype = xmlValue(x[["runtype"]]),
       t(do.call(cbind, xpathApply(x, "sample", function(y) {
            sample <- xmlAttrs(y)[["id"]]
            xpathSApply(y, "test", function(z) {
                   c(sample = sample,
                        test = xmlAttrs(z)[["name"]],
                        order = xmlAttrs(z)[["order"]],
                        code = xmlValue(z[["code"]]),
                        validuntil = xmlValue(z[["validuntil"]]),
                        baseline = xmlValue(z["meas"][[1]]),
                        std = xmlValue(z["meas"][[2]]),
                        data = xmlValue(z["meas"][[3]]),
                        calc = xmlValue(z[["calc"]]),
                        result = xmlValue(z[["result"]])
             )})}))),
       date = xmlAttrs(x)[["date"]],
       time = xmlAttrs(x)[["time"]]
)}))

giving:

  experiment technician location temp   runtype  sample   test order
1     abc123     "John"     "CO" 21.3 "routine"    2323 laslum     3
2     abc123     "John"     "CO" 21.3 "routine"    2323    atr     1
3     abc123     "John"     "CO" 21.3 "routine" 8979237  absat     2
         code validuntil baseline    std    data   calc    result     date
1   "LL18179"  "2016/08"   0.3248 5.4389  6.5980 1.2131      "OK" 20150731
2 "ATR150607"  "2017/05"   0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3   "AA09453"  "2016/03"   0.0117 5.6012  1.1431 0.2041    "FAIL" 20150731
    time
1 113322
2 113322
3 113322

Note:

Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
    <technician>"John"</technician>
    <location>"CO"</location>
    <temp scale="celsius">21.3</temp>
    <runtype>"routine"</runtype>
    <sample id="2323">
        <test name="laslum" order="3">
            <code>"LL18179"</code>
            <validuntil>"2016/08"</validuntil>
            <meas name="baseline">0.3248</meas>
            <meas name="std">5.4389</meas>
            <meas name="data">6.5980</meas>
            <calc>1.2131</calc>
            <result>"OK"</result>
        </test>
        <test name="atr" order="1">
            <code>"ATR150607"</code>
            <validuntil>"2017/05"</validuntil>
            <meas name="baseline">0.0673</meas>
            <meas name="std">4.9721</meas>
            <meas name="data">10.3851</meas>
            <calc>2.0886</calc>
            <result>"Warning"</result>
        </test>
    </sample>
    <sample id="8979237">
        <test name="absat" order="2">
            <code>"AA09453"</code>
            <validuntil>"2016/03"</validuntil>
            <meas name="baseline">0.0117</meas>
            <meas name="std">5.6012</meas>
            <meas name="data">1.1431</meas>
            <calc>0.2041</calc>
            <result>"FAIL"</result>
        </test>
    </sample>
</experiment>'
Comments