torentino torentino - 1 year ago 124
HTML Question

Scraping .asp site with R

I'm scraping

http://www.progarchives.com/album.asp?id=
and get a warning message:


Warning message:

XML content does not seem to be XML:

http://www.progarchives.com/album.asp?id=2

http://www.progarchives.com/album.asp?id=3 http://www.progarchives.com/album.asp?id=4

http://www.progarchives.com/album.asp?id=5


The scraper works for each page separately but not for the urls
b1=2:b2=1000
.

library(RCurl)
library(XML)

getUrls <- function(b1,b2){
root="http://www.progarchives.com/album.asp?id="
urls <- NULL
for (bandid in b1:b2){
urls <- c(urls,(paste(root,bandid,sep="")))
}
return(urls)
}

prog.arch.scraper <- function(url){
SOURCE <- getUrls(b1=2,b2=1000)
PARSED <- htmlParse(SOURCE)
album <- xpathSApply(PARSED,"//h1[1]",xmlValue)
date <- xpathSApply(PARSED,"//strong[1]",xmlValue)
band <- xpathSApply(PARSED,"//h2[1]",xmlValue)
return(c(band,album,date))
}

prog.arch.scraper(urls)

Answer Source

Here's an alternate approach with rvest and dplyr:

library(rvest)
library(dplyr)
library(pbapply)

base_url <- "http://www.progarchives.com/album.asp?id=%s"

get_album_info <- function(id) {

  pg <- html(sprintf(base_url, id))
  data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
             date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
             band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
             stringsAsFactors=FALSE)

}

albums <- bind_rows(pblapply(2:10, get_album_info))

head(albums)

## Source: local data frame [6 x 3]
## 
##                        album                           date      band
## 1                    FOXTROT Studio Album, released in 1972   Genesis
## 2              NURSERY CRYME Studio Album, released in 1971   Genesis
## 3               GENESIS LIVE         Live, released in 1973   Genesis
## 4        A TRICK OF THE TAIL Studio Album, released in 1976   Genesis
## 5 FROM GENESIS TO REVELATION Studio Album, released in 1969   Genesis
## 6           GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz

I didn't feel like barraging the site with a ton of reqs so bump up the sequence for your use. pblapply gives you a free progress bar.

To be kind to the site (esp since it doesn't explicitly prohibit scraping) you might want to throw a Sys.sleep(10) at the end of the get_album_info function.

UPDATE

To handle server errors (in this case 500, but it'll work for others, too), you can use try:

library(rvest)
library(dplyr)
library(pbapply)
library(data.table)

base_url <- "http://www.progarchives.com/album.asp?id=%s"

get_album_info <- function(id) {

  pg <- try(html(sprintf(base_url, id)), silent=TRUE)

  if (inherits(pg, "try-error")) {
    data.frame(album=character(0), date=character(0), band=character(0))
  } else {
    data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
               date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
               band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
               stringsAsFactors=FALSE)
  }

}

albums <- rbindlist(pblapply(c(9:10, 23, 28, 29, 30), get_album_info))

##                       album                           date         band
## 1: THE DANGERS OF STRANGERS Studio Album, released in 1988    Abel Ganz
## 2:    THE DEAFENING SILENCE Studio Album, released in 1994    Abel Ganz
## 3:             AD INFINITUM Studio Album, released in 1998 Ad Infinitum

You won't get any entries for the errant pages (in this case it just returns id 9, 10 and 30's entries).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download