jasperchen jasperchen - 2 months ago 13
R Question

Use xpathSApply in R

I would like to get the information of href from below.

http://www.mitbbs.com/bbsdoc1/USANews_101_0.html

I prefer to get someting from each topic like this

/USANews/31587637.html

/USANews/31587633.html

/USANews/31587631.html

...

The code is used below, but it doesn't work.

library("XML")
library("httr")
library("stringr")

data <- list()

for( i in 101:201){
url <- paste('bbsdoc1/USANews_', i, '_0.html', sep='')
html <- content(GET("http://www.mitbbs.com/", path = url),as = 'parsed')
url.list <- xpathSApply(html, "//td[@align='left' height=26]/[@class='news1' href]", xmlAttrs)
data <- rbind(data, url.list)

}


Your suggestions are really appreicated!

Answer

Retrieve the document

library(XML)
html = htmlParse("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html")

and extract the links and text you're interested in using the appropriate xpath query

href = "//a[./@class='news1']/@href"
text = "//a[./@class='news1']/text()"
df = data.frame(
    url=sub("article_t/", "", sapply(html[href], as.character)),
    text=trimws(sapply(html[text], xmlValue)))

trimws() is a function in recent versions of R.

Comments