majesus majesus - 1 year ago 102
HTML Question

htmlParse - inner text

I need to scrape this text: from an html document using htmlParse (package: XML) in R:

<h1 class="IT">
<span class="f" id="hotel">HOTEL</span>
<span class="nowrap">
<i class="b stars ratings_stars_5 star_track" data-track-on-mouseover=""></i>

I am using this code (code-example) to scrape the name of hotels. However, I need to add the rating of the hotels:

for (i in seq_len(3)){

txt <- getURL(url=baseURL[i], followlocation = TRUE, encoding="UTF-8")
doc <- htmlParse(txt)

hotel <- cssApply(doc, ".details>h3", cssCharacter)
hotel <- cssApplyInNodeSet(doc, ".details", "h3", cssCharacter)
data <- cbind(hotel)

Answer Source

rvest can generally make these ops much easier:


pg <- html("")

pg %>% 
  html_nodes("i.b-sprite.stars") %>% 
  html_attr("class") %>% 
  str_extract("ratings_stars_[[:digit:]]+") %>% 
  str_replace("ratings_stars_", "") %>% 

## [1] 5

pg %>% 
  html_nodes("span#hp_hotel_name") %>% 

## [1] "Hotel Alfonso XIII"

should be very straightforward to stick results in a data.frame, wrap the iteration in an lapply then dplyr::bind_rows


Since you're stuck with the CSS package you can use rvest + the cssApply in the exact same manner:

pg <- html("")

pg %>% 
  cssApply("i.b-sprite.stars", cssClass) %>% 
  str_extract("ratings_stars_[[:digit:]]+") %>% 
  str_replace("ratings_stars_", "") %>% 

pg %>% cssApply("span#hp_hotel_name", cssCharacter) 
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download