rfairy rfairy - 4 months ago 20
R Question

Scraping of broken html website

I am completely new to coding in R, and have been flung out into webscraping.
I'm interested in scraping a list of 1012 links that look similar to this example: http://karakterstatistik.stads.ku.dk/Histogram/ASOB05038E/Summer-2015

So all elements are located the same places on all the links. However, using

for scraping does not work. I have alsp tried using xpath, but it is of no use:

link = "http://karakterstatistik.stads.ku.dk/Histogram/ASOB05038E/Summer-2015"
link %>%
read_html() %>%
html_nodes(xpath = "//*[@id='karsumForm']/table/tbody/tr[8]/td[2]")

I get the error message:

{xml_nodeset (0)}


The HTML isn't broken and you have to be careful when you try to come up with an XPath from the "Inspect Element" view since most browsers will normalize the HTML as they read it in. So Firefox, Chrome (et al) may show a nice table > tbody > tr > ... structure but the tbody tag may not be there on the page.


URL <- "http://karakterstatistik.stads.ku.dk/Histogram/ASOB05038E/Summer-2015"

pg <- read_html(URL)

html_nodes(pg, xpath=".//form[@id='karsumForm']/table/tr[8]/td[2]") %>% 
  html_text() %>% 
## [1] "115"

You can use view-source in most browsers to see the unadulterated HTML source or devtools::install_github("hrbrmstr/xmlview") and do xmlview::xml_view(pg) on the pg in the code snippet above to see the raw HTML from the site (there's a mode in my xmlview package that lets you test out XPath filters, too).

If there are non-duplicated "named fields" then you can do something like:

get_val <- function(x, label) {
  xpath <- sprintf(".//table/tr/td[contains(., '%s')][1]/following-sibling::td", label)
  html_nodes(x, xpath=xpath) %>% html_text() %>% trimws()

get_val(pg, "Fakultet")
## [1] "Det Samfundsvidenskabelige Fakultet"

get_val(pg, "Institut")
## [1] "Sociologisk Institut"

get_val(pg, "Termin")
## [1] "s15"

get_val(pg, "ECTS")
## [1] "15"

get_val(pg, "Andre versioner") %>% gsub("[[:space:]]+", ", ", .)
## [1] "s16, v15, s14, s13, s12, s11"

You can somewhat deal with dups:

get_val(pg, "Antal tilmeldte")
## [1] "115"             ""                "Antal tilmeldte" "11"       

but it may not be perfect.

You can get far more targeted if you hone your XPath skills (I won't be posting any more for this answer).