Anthony Potts Anthony Potts - 2 months ago 12x
R Question

Scraping data from a site with multiple urls

I've been trying to scrape a list of companies off of the site -Company list401.html. I can scrape the single table off of this page with this code:

>fileurl = read_html("
> content = fileurl %>%
+ html_nodes(xpath = '//*[@id="MagListDataTable"]/table[2]') %>%
+ html_table()
>contentframe = data.frame(content)
> view(contentframe)

However, I need all of the data that goes back to 1955 from 2005 as well as a list of the companies 1 through 500, whereas this list only shows 100 companies and a single year at a time. I've recognized that the only changes to the url are "...fortune500_archive/full/" YEAR "/" 1, 201,301, or 401 (per range of companies showing).

I also understand that I have to create a loop that will automatically collect this data for me as opposed to me manually replacing the url after saving each table. I've tried a few variations of sapply functions from reading other posts and watching videos, but none will work for me and I'm lost.


A few suggestions to get you started. First, it may be useful to write a function to download and parse each page, e.g.

getData <- function(year, start) {
  url <- sprintf("", 
    year, start)
  fileurl <- read_html(url)
  content <- fileurl %>%
    html_nodes(xpath = '//*[@id="MagListDataTable"]/table[2]') %>%
  contentframe <- data.frame(content)

We can then loop through the years and pages using lapply (as well as, ...) to rbind all 5 dataframes from each year together). E.g.:

D <- lapply(2000:2005, function(year) {, lapply(seq(1, 500, 100), function(start) {
    cat(paste("Retrieving", year, ":", start, "\n"))
    getData(year, start)