Piyush Saxena Piyush Saxena - 4 years ago 112
R Question

Getting first line search results from google

I am using XML and RCurl packages in R to get the data from the first page

site <- getForm("http://www.google.com/search", hl="en",lr="", q="life of pi", btnG="Search") #q-> query
doc<-htmlParse(site, asText=TRUE)
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)


What should my xpathSApply arguments be so I only get the first lines of the search results( the ones in Blue with a bigger font)

Answer Source

Maybe start with the header or other tags before trying not(ancestor) stuff

xpathSApply(doc, "//h3", xmlValue)
 [1] "LIFE OF PI - Buy it on Digital HD, Blu-ray & DVD"
 [2] "Life of Pi - Wikipedia, the free encyclopedia"
 [3] "Life of Pi (film) - Wikipedia, the free encyclopedia"
 [4] "Images for life of pi" 
 [5] "Life of Pi (2012) - IMDb" 
 ...
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download