jasperchen jasperchen - 10 months ago 67
R Question

Use xpathSApply in R

I would like to get the information of href from below.


I prefer to get someting from each topic like this





The code is used below, but it doesn't work.


data <- list()

for( i in 101:201){
url <- paste('bbsdoc1/USANews_', i, '_0.html', sep='')
html <- content(GET("http://www.mitbbs.com/", path = url),as = 'parsed')
url.list <- xpathSApply(html, "//td[@align='left' height=26]/[@class='news1' href]", xmlAttrs)
data <- rbind(data, url.list)


Your suggestions are really appreicated!

Answer Source

Retrieve the document

html = htmlParse("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html")

and extract the links and text you're interested in using the appropriate xpath query

href = "//a[./@class='news1']/@href"
text = "//a[./@class='news1']/text()"
df = data.frame(
    url=sub("article_t/", "", sapply(html[href], as.character)),
    text=trimws(sapply(html[text], xmlValue)))

trimws() is a function in recent versions of R.