AI52487963 AI52487963 - 3 months ago 46
R Question

Parsing XML with no style information?

I'm trying to do a simple XML parse from web, but I seem to be hitting some roadblocks. If I try to do a classic XML parse:

library(XML)
url <- c("http://www.boardgamegeek.com/xmlapi/boardgame/173346?stats=1")
xml <- xmlTreeParse(url, encoding = "UTF-8", isURL=TRUE)


I get:

Unknown encoding "UTF-8"
Error: 1: Unknown encoding "UTF-8"


Even though it seems like I specified the encoding already. Looking at the XML from the site, it says across the top that it doesn't have any style information, but displays the document tree anyway. Then, if I try to do an htmlParse instead,

file <- htmlTreeParse(url, encoding = "UTF-8", isURL=TRUE)


I get:

Error in which(value == defs) :
argument "code" is missing, with no default


Is there something obvious I'm missing here?

Answer

You may find it easier in the long run to move to rvest and xml2:

library(rvest)

pg <- read_xml("http://www.boardgamegeek.com/xmlapi/boardgame/173346?stats=1")

xml_nodes(pg, xpath="//name") %>% xml_text()

xml_nodes(pg, xpath="//description") %>% xml_text()

xml_nodes(pg, xpath="//boardgamehonor") %>% xml_text()

xml_nodes(pg, xpath="//name[@primary='true' and @sortindex=1]") %>% xml_text()
Comments