DenJJ DenJJ - 1 year ago 56
R Question

Improving my R code - advice wanted?

I have code that is working, its a webscrapping script that first gets from URLs from a webpage and then uses a for loop to run through all URLs. During the loop it takes some information and saves it to a data frame that I first create as an empty data frame before the loop. The process uses rbind and works fine.

However, I feel this code is not optimal and there maybe a package, I think the solution will be lapply... maybe not. But I was hoping someone could give me a pointer to a better way of coding this (if that exists) and how it could be implemented.


URL <- ""

WS <- read_html(URL)

URLs <- WS %>% html_nodes(".hide-for-pad .vereinprofil_tooltip") %>% html_attr("href") %>% as.character()
URLs <- paste0("",URLs)

Catcher1 <- data.frame(Player=character(),P_URL=character())

for (i in URLs) {

WS1 <- read_html(i)
Player <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_text() %>% as.character()
P_URL <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_attr("href") %>% as.character()
temp <- data.frame(Player,P_URL)
Catcher1 <- rbind(Catcher1,temp)

Answer Source

You could try using purrr instead of the loop as follows:


URLs %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))


   user  system elapsed 
  2.939   2.746   5.699 

The step that take the most time is the crawling via map(read_html).
To paralyze that you can use e.g. the parallel backend of plyr as follows:

doMC::registerDoMC(cores=3) # cores depending on your system
plyr::llply(URLs, GET, .parallel = TRUE) %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

Somehow my Rstudio crashed using plyr::llply(URLs, read_html, .parallel = TRUE) thats why i use the underlying httr::GET and parse the result in the next step via map(read_html). So the scraping is done in parallel but the parsing of the response is done sequentially.


   user  system elapsed 
  2.505   0.337   2.940 

In both cases the result looks as follows:

# A tibble: 1,036 × 2
          Player                                P_URL
           <chr>                                <chr>
1   David de Gea   /david-de-gea/profil/spieler/59377
2      D. de Gea   /david-de-gea/profil/spieler/59377
3  Sergio Romero  /sergio-romero/profil/spieler/30690
4      S. Romero  /sergio-romero/profil/spieler/30690
5  Sam Johnstone /sam-johnstone/profil/spieler/110864
6   S. Johnstone /sam-johnstone/profil/spieler/110864
7    Daley Blind    /daley-blind/profil/spieler/12282
8       D. Blind    /daley-blind/profil/spieler/12282
9    Eric Bailly   /eric-bailly/profil/spieler/286384
10     E. Bailly   /eric-bailly/profil/spieler/286384
# ... with 1,026 more rows