Feng Chen Feng Chen - 1 month ago 16
R Question

how to just retrieve the titles from the query result using rvest

I use rvest to retrieve the titles from google query result. My code is like this:

> url = URLencode(paste0("https://www.google.com.au/search?q=","600d"))
> page <- read_html(url)
> page %>%
html_nodes("a") %>%
html_text()


However, the result includes not only just titles, but also something else, like:

[24] "Past month"
[25] "Past year"
[26] "Verbatim"
[27] "EOS 600D - Canon"
[28] "Similar"
[29] "Canon 600D | BIG W"
[30] "Cached"
[31] "Similar"
......
[45] ""
[46] ""


where what I need are [27] "EOS 600D - Canon" and [29] "Canon 600D | BIG W". They are shown in google query like this:enter image description here

All of others are just noises for me. Could anyone please tell me how to get rid of those?

Also, if I want the description part as well, what I should do?

Answer

To just get the titles, do not use <a> (=link) but <h3>:

page %>% 
  html_nodes("h3") %>%
  html_text()

 [1] "EOS 600D - Canon"                                                   
 [2] "Canon EOS 600D - Wikipedia"                                         
 [3] "Canon 600D | BIG W"                                                 
 [4] "Canon EOS 600D Digital SLR Camera with 18-55mm IS Lens Kit ..."     
 [5] "Canon Rebel T3i / EOS 600D Review: Digital Photography Review"      
 [6] "Canon EOS 600D review - CNET"                                       
 [7] "canon eos 600d | Cameras | Gumtree Australia Free Local Classifieds"
 [8] "Images for 600d"                                                    
 [9] "Canon 600D - Snapsort"                                              
[10] "Canon EOS 600D - Georges Cameras"