Texas_Crane Texas_Crane - 17 days ago 9
R Question

R XML to DataFrame subsetting

I have a bunch of chat transcripts in XML format. I've included a sample record below. I don't need the entire record, just 3 things. The first is the attribute @realTimeID. The second/third items are the values from varValue when source == "PostChat". This will include a numeric value from 1-10. There may be a second value that includes a text entry. Only a small number of records will contain these "PostChat" values.

What I want is a data frame with a column for realTimeID and then 2 columns for the numeric and possible text value. If I could get a dataframe with a column for realTimeID and a second column for value, I could work with the data from there just fine.

Here's some sample data:

<Report account="12345" end_time="2016-07-01T00:00:59+00:00" limit="10000" more_sessions="true" start_time="2016-06-11T00:00:00+00:00" user="smith">
<Session id="ID1536678170" realTimeID="4768543970">
<Visitor id="1131902386012684">
<ip>123.456.789</ip>
<agent>Chrome 51.0.2704.63</agent>
<host/>
<chatReferer> foo </chatReferer>
<GeoInfo>
<geoCity/>
<geoConType/>
<geoCountry>USA</geoCountry>
<geoIP>123.456.789</geoIP>
<geoISP>USA ISP</geoISP>
<geoOrg>NA</geoOrg>
<geoPost/>
<geoReg/>
</GeoInfo>
</Visitor>
<Chat end_time="2016-06-11T21:46:14+00:00" start_time="2016-06-11T21:25:59+00:00">
<line by="info" time="2016-06-11T21:25:59+00:00">
<Text>Please do not post credit card or other sensitive data in this window. </Text>
</line>
<line by="info" time="2016-06-11T21:26:03+00:00">
<Text>You are now chatting with John.</Text>
</line>
<line by="John" repId="ID2447" time="2016-06-11T21:28:04+00:00">
<HTML>&lt;span dir="ltr"&gt;Hi sir&lt;/span&gt;</HTML>
</line>
<line by="John" repId="ID2447" time="2016-06-11T21:28:15+00:00">
<HTML>&lt;span dir="ltr"&gt;How may i help you ?&lt;/span&gt;</HTML>
</line>
<line by="you" time="2016-06-11T21:28:16+00:00">
<Text>Hi John. Im Bob. I have a technical question.</Text>
</line>
</Chat>
<VarValues>
<varValue id="ID917165" name="DisconnectedBy" source="Internal" sourceName="null" time="2016-06-11T21:46:14+00:00">RepStoppedChat</varValue>
<varValue id="ID922205" name="language" source="MonitorTag" sourceName="null" time="2016-06-11T21:23:46+00:00">English</varValue>
<varValue id="ID1317606" name="pageLoadTime" source="MonitorTag" sourceName="null" time="2016-06-11T21:23:46+00:00">88 sec</varValue>
<varValue id="ID1323660" name="survey90990357" source="Operator" sourceName="null" time="2016-06-11T21:32:38+00:00">Incomplete (INC) - customer abandons</varValue>
<varValue id="ID1372749" name="LP_Visitor_Category" source="Internal" sourceName="null" time="2016-06-11T21:23:43+00:00">0</varValue>
<varValue id="ID1617100" name="live_engage_control_group" source="Internal" sourceName="null" time="2016-06-11T21:23:43+00:00">false</varValue>
<varValue id="ID3647561" name="rerouteFlag" source="Rule Engine" sourceName="null" time="2016-06-11T21:23:46+00:00">true</varValue>
<varValue id="ID3665417" name="operatorName" source="Rule Engine" sourceName="null" time="2016-06-11T21:26:03+00:00">John Doe</varValue>
<varValue id="ID3730453" name="RenameFlag" source="Rule Engine" sourceName="null" time="2016-06-11T21:23:46+00:00">true</varValue>
<varValue id="ID3742796" name="PT-EligibleInSession" source="Rule Engine" sourceName="null" time="2016-06-11T21:24:43+00:00">Yes</varValue>
<varValue id="ID3834774" name="survey88140234" source="PostChat" sourceName="null" time="2016-06-13T04:44:54+00:00">10</varValue>
<varValue id="ID3834774" name="survey88140234" source="PostChat" sourceName="null" time="2016-06-13T04:44:54+00:00">Great Experience. Thanks for the help!</varValue>
</VarValues>
<Reps>
<Rep endTime="2016-06-11T21:46:14+00:00" id="ID2447" order="1" repName="John Doe" startTime="2016-06-11T21:26:03+00:00">John</Rep>
</Reps>
</Session>
<Session id="ID1536678170" realTimeID="123456789">
<Visitor id="1131902386012684">
<ip>123.456.789</ip>
<agent>Chrome 51.0.2704.63</agent>
<host/>
<chatReferer> foo </chatReferer>
<GeoInfo>
<geoCity/>
<geoConType/>
<geoCountry>USA</geoCountry>
<geoIP>123.456.789</geoIP>
<geoISP>USA ISP</geoISP>
<geoOrg>NA</geoOrg>
<geoPost/>
<geoReg/>
</GeoInfo>
</Visitor>
<Chat end_time="2016-06-11T21:46:14+00:00" start_time="2016-06-11T21:25:59+00:00">
<line by="info" time="2016-06-11T21:25:59+00:00">
<Text>Please do not post credit card or other sensitive data in this window. </Text>
</line>
<line by="info" time="2016-06-11T21:26:03+00:00">
<Text>You are now chatting with John.</Text>
</line>
<line by="John" repId="ID2447" time="2016-06-11T21:28:04+00:00">
<HTML>&lt;span dir="ltr"&gt;Hi sir&lt;/span&gt;</HTML>
</line>
<line by="John" repId="ID2447" time="2016-06-11T21:28:15+00:00">
<HTML>&lt;span dir="ltr"&gt;How may i help you ?&lt;/span&gt;</HTML>
</line>
<line by="you" time="2016-06-11T21:28:16+00:00">
<Text>Hi John. Im Bob. I have a technical question.</Text>
</line>
</Chat>
<VarValues>
<varValue id="ID917165" name="DisconnectedBy" source="Internal" sourceName="null" time="2016-06-11T21:46:14+00:00">RepStoppedChat</varValue>
<varValue id="ID922205" name="language" source="MonitorTag" sourceName="null" time="2016-06-11T21:23:46+00:00">English</varValue>
<varValue id="ID1317606" name="pageLoadTime" source="MonitorTag" sourceName="null" time="2016-06-11T21:23:46+00:00">88 sec</varValue>
<varValue id="ID1323660" name="survey90990357" source="Operator" sourceName="null" time="2016-06-11T21:32:38+00:00">Incomplete (INC) - customer abandons</varValue>
<varValue id="ID1372749" name="LP_Visitor_Category" source="Internal" sourceName="null" time="2016-06-11T21:23:43+00:00">0</varValue>
<varValue id="ID1617100" name="live_engage_control_group" source="Internal" sourceName="null" time="2016-06-11T21:23:43+00:00">false</varValue>
<varValue id="ID3647561" name="rerouteFlag" source="Rule Engine" sourceName="null" time="2016-06-11T21:23:46+00:00">true</varValue>
<varValue id="ID3665417" name="operatorName" source="Rule Engine" sourceName="null" time="2016-06-11T21:26:03+00:00">John Doe</varValue>
<varValue id="ID3730453" name="RenameFlag" source="Rule Engine" sourceName="null" time="2016-06-11T21:23:46+00:00">true</varValue>
<varValue id="ID3742796" name="PT-EligibleInSession" source="Rule Engine" sourceName="null" time="2016-06-11T21:24:43+00:00">Yes</varValue>
</VarValues>
<Reps>
<Rep endTime="2016-06-11T21:46:14+00:00" id="ID2447" order="1" repName="John Doe" startTime="2016-06-11T21:26:03+00:00">John</Rep>
</Reps>
</Session>
</Report>


I can read in the data:

library(XML)

dat <- xmlInternalTreeParse("data/sessions6.xml", useInternalNodes = T)


I can extract the realTimeID value using:

foo <- xpathApply(datRoot, "//Session", xmlGetAttr, "realTimeID")


and the varValues I need using:

tmp <- xpathApply(datRoot, "//varValue[@source='PostChat']", xmlValue)


But I don't know how to connect the two and grab the realTimeID values that are associated with the PostChat varValues. Alternatively, I thought about creating a dataframe with realTimeID and all of the VarValues. I obviously have the list of all the IDs but wasn't sure how I could just extract a dataframe of VarValues. Any help would be appreciated.

EDIT: I've updated my XML code sample to make it more complete and include one session with PostChat values and one without. Thanks!

Answer

For the updated xml-file you can to this:

require(xml2)
require(tidyverse)
doc <- read_xml(path_to_file)
sessions <- doc %>% xml_find_all("Session")
realTimeID <- sessions %>% map(xml_attr, "realTimeID")
varValue <- sessions %>% 
  map(xml_find_all, ".//varValue[@source='PostChat']") %>% 
  map(xml_text)
map2_df(realTimeID, varValue, ~tibble(realTimeID = .x, value = .y[1], text = .y[2]))

Which gives you:

# A tibble: 2 × 3
  realTimeID value                                    text
       <chr> <chr>                                   <chr>
1 4768543970    10 Great Experience.  Thanks for the help!
2  123456789  <NA>                                    <NA>

You can use type_convert to change the column typed