manzoor Razaak manzoor Razaak - 1 year ago 67
R Question

Google search links obtain by webscraping in R are not in required format

I am new to webscraping in R and trying to run google search action using a search term from R and extract links automatically. I am partially successful in obtaining the links of google search results using RCurl and XML package. However, the href links I extract include unwanted information and are not in the format of a "url".

The code I use is:

html <- getURL(u)
links <- xpathApply(doc, "//h3//a[@href]", xmlGetAttr, 'href')
links <- grep("http://", links, fixed = TRUE, value=TRUE)

The above code gives me seven links, however they are in the below format:

[1] "/url?q="

I would prefer them to be:

How do I extract the href as above?

Answer Source

Using rvest package (which also uses XML package but has a lot of handy features related to scraping)

ht <- read_html('')
links <- ht %>% html_nodes(xpath='//h3/a') %>% html_attr('href')


[1] ""                                                                   
[2] ""                                    
[3] ""                                                                      
[4] ""                                                                  
[5] ""
[6] ""                                                                        
[7] ""                                                      
[8] ""                                                                  
[9] ""   

The fourth line in the code cleans the text. First splits the resulted url (that comes with garbage) wrt '&' and then takes the first element of the resulted split and replaces '/url?q=' with empty.

Hope it helps!

