ScrapeGoat ScrapeGoat - 1 year ago 769
R Question

Iterating rvest scrape function gives: "Error in open.connection(x, "rb") : Timeout was reached"

I'm scraping this website using the "rvest"-package. When I iterate my function too many times I get "Error in open.connection(x, "rb") : Timeout was reached". I have searched for similar questions but the answers seems to lead to dead ends. I have a suspicion that it is server side and the website has a build-in restriction on how many times I can visit the page. How do investigate this hypothesis?

The code: I have the links to the underlying web pages and want to construct a data frame with the information extracted from the associated web pages. I have simplified my scraping function a bit as the problem is still occurring with a simpler function:

scrape_test = function(link) {

slit <- str_split(link, "/") %>%
id <- slit[5]
sem <- slit[6]

name <- link %>%
read_html(encoding = "UTF-8") %>%
html_nodes("h2") %>%
html_text() %>%
str_replace_all("\r\n", "") %>%

return(data.frame(id, sem, name))

I use the purrr-package map_df() to iterate the function: = links %>%

Now, if I iterate the function using only 50 links I receive no error. But when I increase the number of links I encounter the before-mentioned error. Furthermore I get the following warnings:

  • "In bind_rows_(x, .id) : Unequal factor levels: coercing to character"

  • "closing unused connection 4 (link)"

EDIT: The following code making an object of links can be used to reproduce my results:

links <- c(rep("", 100))

Answer Source

With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:

d <- vector("list", length(links))

Here I do a for-loop, with a tryCatch block so that if the output is an error, we wait a couple of seconds and try again. We also include a counter that moves on to the next link if we're still getting an error after five attempts. In addition, we have if (!(links[i] %in% names(d))) so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.

for (i in seq_along(links)) {
  if (!(links[i] %in% names(d))) {
    cat(paste("Doing", links[i], "..."))
    ok <- FALSE
    counter <- 0
    while (ok == FALSE & counter <= 5) {
      counter <- counter + 1
      out <- tryCatch({                  
                error = function(e) {
      if ("error" %in% class(out)) {
      } else {
        ok <- TRUE
        cat(" Done.")
    d[[i]] <- out
    names(d)[i] <- links[i]