torentino torentino - 5 months ago 34
HTML Question

Scraping .asp site with R

I'm scraping

http://www.progarchives.com/album.asp?id=
and get a warning message:


Warning message:

XML content does not seem to be XML:

http://www.progarchives.com/album.asp?id=2

http://www.progarchives.com/album.asp?id=3 http://www.progarchives.com/album.asp?id=4

http://www.progarchives.com/album.asp?id=5


The scraper works for each page separately but not for the urls
b1=2:b2=1000
.

library(RCurl)
library(XML)

getUrls <- function(b1,b2){
root="http://www.progarchives.com/album.asp?id="
urls <- NULL
for (bandid in b1:b2){
urls <- c(urls,(paste(root,bandid,sep="")))
}
return(urls)
}

prog.arch.scraper <- function(url){
SOURCE <- getUrls(b1=2,b2=1000)
PARSED <- htmlParse(SOURCE)
album <- xpathSApply(PARSED,"//h1[1]",xmlValue)
date <- xpathSApply(PARSED,"//strong[1]",xmlValue)
band <- xpathSApply(PARSED,"//h2[1]",xmlValue)
return(c(band,album,date))
}

prog.arch.scraper(urls)

Answer

Here's an alternate approach with rvest and dplyr:

library(rvest)
library(dplyr)
library(pbapply)

base_url <- "http://www.progarchives.com/album.asp?id=%s"

get_album_info <- function(id) {

  pg <- html(sprintf(base_url, id))
  data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
             date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
             band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
             stringsAsFactors=FALSE)

}

albums <- bind_rows(pblapply(2:10, get_album_info))

head(albums)

## Source: local data frame [6 x 3]
## 
##                        album                           date      band
## 1                    FOXTROT Studio Album, released in 1972   Genesis
## 2              NURSERY CRYME Studio Album, released in 1971   Genesis
## 3               GENESIS LIVE         Live, released in 1973   Genesis
## 4        A TRICK OF THE TAIL Studio Album, released in 1976   Genesis
## 5 FROM GENESIS TO REVELATION Studio Album, released in 1969   Genesis
## 6           GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz

I didn't feel like barraging the site with a ton of reqs so bump up the sequence for your use. pblapply gives you a free progress bar.

To be kind to the site (esp since it doesn't explicitly prohibit scraping) you might want to throw a Sys.sleep(10) at the end of the get_album_info function.

UPDATE

To handle server errors (in this case 500, but it'll work for others, too), you can use try:

library(rvest)
library(dplyr)
library(pbapply)
library(data.table)

base_url <- "http://www.progarchives.com/album.asp?id=%s"

get_album_info <- function(id) {

  pg <- try(html(sprintf(base_url, id)), silent=TRUE)

  if (inherits(pg, "try-error")) {
    data.frame(album=character(0), date=character(0), band=character(0))
  } else {
    data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
               date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
               band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
               stringsAsFactors=FALSE)
  }

}

albums <- rbindlist(pblapply(c(9:10, 23, 28, 29, 30), get_album_info))

##                       album                           date         band
## 1: THE DANGERS OF STRANGERS Studio Album, released in 1988    Abel Ganz
## 2:    THE DEAFENING SILENCE Studio Album, released in 1994    Abel Ganz
## 3:             AD INFINITUM Studio Album, released in 1998 Ad Infinitum

You won't get any entries for the errant pages (in this case it just returns id 9, 10 and 30's entries).