MichaelChirico MichaelChirico - 2 months ago 13
R Question

Name is not XML Namespace compliant

I'm trying to read the table on this site:

http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16

I use

rvest
, but quickly get an error:

library(rvest)
read_html("http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16")



Error: Name spoiler:3tbt4d3m is not XML Namespace compliant [202]


What does this error mean, and is there anything I can do to get around it?

I've gotten as far as pinpointing the internal function causing the error:
xml2:::doc_parse_raw
. However,
xml2:::doc_parse_raw
is simply a call to internal C code, making debugging of this issue substantially more difficult.

Answer

Another option is to use htmltidy (need to use v0.3.0 or higher which means—as of the date of this answer—using the development version vs CRAN version until CRAN is up to 0.3.0+) to "clean" the document:

library(rvest)
library(htmltidy) # devtools::install_github("hrbrmstr/htmltidy")
library(httr)

URL <- "http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16"

# the site was not returning content for me w/o a more browser-like user agent

res <- GET(URL, user_agent("Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36"))

cleaned <- tidy_html(content(res, as="text", encoding="UTF-8"),
                     list(TidyDocType="html5"))

pg <- read_html(cleaned)