JSkjold JSkjold - 3 months ago 11
R Question

R: Trouble accessing/extracting values from XML

For the last 4 hours I have tried accessing some values in xml file through R without any luck, and now I am all empty for new ideas.

This is the xml file: http://opcom.ro/order_book/OBK-30XROOPCOM-----C-2016-08-19.xml

I'm trying to get all Qty values in //OrderTimeSeries//SupplyCurve//Period//Interval//Point, so that the first entry would be 0, the next 0 and then 23 and so on.

I have tried stuff like:

library(XML)

doc <- xmlParse("http://opcom.ro/order_book/OBK-30XROOPCOM-----C-2016-08-19.xml")

qty <- unlist(xpathApply(doc,"//OrderTimeSeries//SupplyCurve//Period//Interval//Point",xmlvalue))


I think this would work if the xml was written like

<Qty>"0.00000000000"<Qty/>


But I dont know how to extract the value when it's written inside the <> with v = 0,00000000000.

Answer

There's a node <Pos="##"/> in each of the <Interval> "records". I suspect that's important data to identify each time series differently and you're kinda just throwing it away with that rough selector.

You can attack the problem in a straightforward way:

  • find each interval
    • extract Pos
    • find Point
      • extract the sub-values of Point

and build a data.frame along the way:

library(xml2)
library(purrr)

doc <- read_xml("http://opcom.ro/order_book/OBK-30XROOPCOM-----C-2016-08-19.xml")

names_and_values <- function(x) {
  names <- xml_name(xml_find_all(x, ".//*"))
  vals <- as.numeric(xml_attr(xml_find_all(x, ".//*"), "v"))
  df <- rbind.data.frame(vals)
  setNames(df, names)
}

pos_and_points <- function(x) {
  pos <- as.numeric(xml_attr(xml_find_first(x, ".//Pos"), "v"))
  xml_find_all(x, ".//Point") %>% 
    map_df(names_and_values) -> df
  df$pos <- pos
  df
}

xml_find_all(doc, ".//OrderTimeSeries/SupplyCurve/Period/Interval") %>% 
  map_df(pos_and_points) -> df

dplyr::glimpse(df)
## Observations: 5,316
## Variables: 4
## $ Qty         <dbl> 0.0, 0.0, 23.0, 23.0, 26.5, 26.5, 56.8, 56.8, 150.5, 150.5, 171.5, 171....
## $ PriceAmount <dbl> -5.000000e+02, -2.500000e+01, -2.500000e+01, -2.235536e+01, -2.235536e+...
## $ SeqNr       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
## $ pos         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...