Dan Dan - 3 months ago 10
R Question

how to scrape all pages (1,2,3,.....n) from a website using r vest

# I would like to read the list of .html files to extract data. Appreciate your help.

library(rvest)
library(XML)
library(stringr)
library(data.table)
library(RCurl)
u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- ("C:/R/BNB/")
pages <- html_text(html_node(u1,".results_count"))
Total_Pages <- substr(pages,4,7)
TP <- as.numeric(Total_Pages)
# reading first two pages, writing them as separate .html files
for (i in 1:TP ) {
url <- paste(u0,"page=/",i, sep="")
download.file(url,paste(download_folder,i,".html",sep=""))
#create html object
html <- html(paste(download_folder,i,".html",sep=""))
}

Answer

Here is a potential solution:

library(rvest)
library(stringr)

u0 <- "https://www.r-users.com/jobs/"
u1 <- read_html("https://www.r-users.com/jobs/")
download_folder <- getwd()  #note change in output directory

TP<-max(as.integer(html_text(html_nodes(u1,"a.page-numbers"))), na.rm=TRUE)

# reading first two pages, writing them as separate .html files 
for (i in 1:TP ) {
  url <- paste(u0,"page/",i, "/", sep="")
  print(url)
  download.file(url,paste(download_folder,i,".html",sep=""))
  #create html object
  html <- read_html(paste(download_folder,i,".html",sep=""))
}

I could not find the class .result-count in the html, so instead I looked for the page-numbers class and pick the highest returned value. Also, the function html is deprecated thus I replaced it with read_html. Good luck