majesus majesus - 7 months ago 21
HTML Question

htmlParse - inner text

I need to scrape this text: from an html document using htmlParse (package: XML) in R:

<h1 class="IT">
<span class="f" id="hotel">HOTEL</span>
<span class="nowrap">
<i class="b stars ratings_stars_5 star_track" data-track-on-mouseover=""></i>
</span>
</span>
</h1>


I am using this code (code-example) to scrape the name of hotels. However, I need to add the rating of the hotels:

for (i in seq_len(3)){

txt <- getURL(url=baseURL[i], followlocation = TRUE, encoding="UTF-8")
doc <- htmlParse(txt)

hotel <- cssApply(doc, ".details>h3", cssCharacter)
hotel <- cssApplyInNodeSet(doc, ".details", "h3", cssCharacter)
data <- cbind(hotel)
}

Answer

rvest can generally make these ops much easier:

library(rvest)
library(stringr)

pg <- html("http://www.booking.com/hotel/es/starwoodalfonso.es.html#tab-reviews")

pg %>% 
  html_nodes("i.b-sprite.stars") %>% 
  html_attr("class") %>% 
  str_extract("ratings_stars_[[:digit:]]+") %>% 
  str_replace("ratings_stars_", "") %>% 
  as.numeric()

## [1] 5

pg %>% 
  html_nodes("span#hp_hotel_name") %>% 
  html_text()

## [1] "Hotel Alfonso XIII"

should be very straightforward to stick results in a data.frame, wrap the iteration in an lapply then dplyr::bind_rows

EDIT

Since you're stuck with the CSS package you can use rvest + the cssApply in the exact same manner:

pg <- html("http://www.booking.com/hotel/es/starwoodalfonso.es.html#tab-reviews")

pg %>% 
  cssApply("i.b-sprite.stars", cssClass) %>% 
  str_extract("ratings_stars_[[:digit:]]+") %>% 
  str_replace("ratings_stars_", "") %>% 
  as.numeric()

pg %>% cssApply("span#hp_hotel_name", cssCharacter)