Sorrentum Sorrentum - 2 months ago 9
R Question

Rvest: getting node text and not its childen's text

The method

html_text()
(from R Package rvest) concatenates the text of the node and all its children. I would like to extract only the father's text.

Forthe following example,
html_text()
gives HELLO GOODBYE.

I want to get just GOODBYE. How can I get it?



<div class="joke">
<div class="div_inside">
<div class="title_inside">
<a class="link" href="sompage.htm">HELLO</a>
</div>
</div>
GOODBYE
</div>




Answer

Try to grab the main div tag with class "joke" without picking up its children, using xpath:

library(rvest)

read_html('your_html_script') %>%
    html_nodes(xpath = '//div[@class="joke"]/node()[not(self::div)]') %>% 
    html_text()

Thanks!