user2840286 user2840286 - 3 months ago 4x
HTML Question

Parsing HTML file in R

I want to read HTML files from a web site. Specifically, I want to read books in HTML format from The title of each chapter is marked with the tag "h2" and the content of each chapter follows in the paragraph tags "p" after the "h2". Using the package XML I am able to get the values or the full HTML code for each tag.

Here is a sample code using George Elliot's Middlemarch:


doc.html = htmlTreeParse('',
useInternal = TRUE)
doc.value <- xpathApply(doc.html, '//h2|//p', xmlValue)
doc.html.value <- xpathApply(doc.html, '//h2|//p')

doc.value contains a list where each element is the content of the tags but I cannot know whether is a h2 tag or p tag. On the other hand, doc.html.value contains a list with the html code for each tag. This gives me the information whether it is an "h2" or "p" tag but it also contains a lot of of extra code (like style information, etc) that I don't need.

My question: Is there a simple way to obtain only the type of the tag and the value of the tag without the other information associated with it?


Looking at the documentation for xmlValue suggests that there is another function by the name of xmlName, which extracts just the name of the tag. Using these two, what you want can be computed: <- xpathApply(doc.html, '//h2|//p', function(x) { list(name=xmlName(x), content=xmlValue(x)); })

[1] "h2"

[1] "\r\nGeorge Eliot\r\n"