ScrapeGoat ScrapeGoat - 3 months ago 29
R Question

Scraping HTML from vector of strings in R

Building on an answer to a former question of mine I'm scraping this website for links with the Rselenium-package using the following code:

startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444,
browserName = "chrome")

remDr$open(silent = TRUE)
remDr$navigate("http://karakterstatistik.stads.ku.dk/")
Sys.sleep(2)

webElem <- remDr$findElement("name", "submit")
webElem$clickElement()
Sys.sleep(5)

html_source <- vector("list", 100)
i <- 1
while (i <= 100) {
html_source[[i]] <- remDr$getPageSource()
webElem <- remDr$findElement("id", "next")
webElem$clickElement()
Sys.sleep(2)
i <- i + 1
}
Sys.sleep(3)
remDr$close()


When I want to scrape the above created vector of strings (html_source) using the rvest-package I get an error as the source is not an HTML-file:

kar.links = html_source %>%
read_html(encoding = "UTF-8") %>%
html_nodes("#searchResults a") %>%
html_attr("href")


I've tried to collapse the vector and tried to look for a string-to-HTML converter, but nothing seems to work.
I feel the solution lies somewhere in how I save the page-sources in the loop.

Answer

html_source is a nested list:

str(head(html_source, 3))
# List of 3
#  $ :List of 1
#   ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ :List of 1
#   ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ :List of 1
#   ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__

In your case, html_source is made up of 100 elements; each element is itself a list with one element, which is a string (and the raw html code). Therefore, to get each raw html page, you need to access html_source[[1]][[1]], html_source[[2]][[1]], and so on.

To flatten html_source, you can do: lapply(html_source, `[[`, 1). We get the same result if we use remDr$getPageSource()[[1]] in the while loop:

str(head(html_source, 3))
# List of 3
#  $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
Comments