Anthony Potts Anthony Potts - 3 months ago 24
R Question

Scraping data from a site with multiple urls

I've been trying to scrape a list of companies off of the site -Company list401.html. I can scrape the single table off of this page with this code:

>fileurl = read_html("http://archive.fortune.com
/magazines/fortune/fortune500_archive/full/2005/1")
> content = fileurl %>%
+ html_nodes(xpath = '//*[@id="MagListDataTable"]/table[2]') %>%
+ html_table()
>contentframe = data.frame(content)
> view(contentframe)


However, I need all of the data that goes back to 1955 from 2005 as well as a list of the companies 1 through 500, whereas this list only shows 100 companies and a single year at a time. I've recognized that the only changes to the url are "...fortune500_archive/full/" YEAR "/" 1, 201,301, or 401 (per range of companies showing).

I also understand that I have to create a loop that will automatically collect this data for me as opposed to me manually replacing the url after saving each table. I've tried a few variations of sapply functions from reading other posts and watching videos, but none will work for me and I'm lost.

Answer

A few suggestions to get you started. First, it may be useful to write a function to download and parse each page, e.g.

getData <- function(year, start) {
  url <- sprintf("http://archive.fortune.com/magazines/fortune/fortune500_archive/full/%d/%d.html", 
    year, start)
  fileurl <- read_html(url)
  content <- fileurl %>%
    html_nodes(xpath = '//*[@id="MagListDataTable"]/table[2]') %>%
    html_table()
  contentframe <- data.frame(content)
}

We can then loop through the years and pages using lapply (as well as do.call(rbind, ...) to rbind all 5 dataframes from each year together). E.g.:

D <- lapply(2000:2005, function(year) {
  do.call(rbind, lapply(seq(1, 500, 100), function(start) {
    cat(paste("Retrieving", year, ":", start, "\n"))
    getData(year, start)
    }))
})
Comments