Ildar Gabdrakhmanov Ildar Gabdrakhmanov - 3 months ago 22
R Question

How to extract text only from parent HTML node (excluding child node)?

I have a code:

<div class="activityBody postBody thing">
<p>
<a href="/forum/conversation/post/3904-22" rel="post" data-id="3904-22" class="mqPostRef">(22)</a>
where?
</p>
</div>


I am using this code to extract text:

html_nodes(messageNode, xpath=".//p") %>% html_text() %>% paste0(collapse="\n")


And getting the result:

"(22) where?"


But I need only "p" text, excluding text that could be inside "p" in child nodes. I have to get this text:

"where"


Is there any way to exclude child nodes while I getting text?

Mac OS 10.11.6 (15G31), RSrudio Version 0.99.903, R version 3.3.1 (2016-06-21)

Answer

This will grab all the text from <p> children (which means it won't include text from sub-nodes that aren't "text emitters":

library(xml2)
library(rvest)
library(purrr)

txt <- '<div class="activityBody postBody thing">
    <p>
        <a href="/forum/conversation/post/3904-22" rel="post" data-id="3904-22" class="mqPostRef">(22)</a>
        where?
    </p>
  <p>
    stays 
    <b>disappears</b>
    <a>disappears</a>
    <span>disappears</span>
    stays
  </p>
</div>'

doc <- read_xml(txt)

html_nodes(doc, xpath="//p") %>% 
  map_chr(~paste0(html_text(html_nodes(., xpath="./text()"), trim=TRUE), collapse=" "))
## [1] "where?"     "stays stays"

Unfortunately, that's pretty "lossy" (you lose <b>, <span>, etc) but this or @Floo0's (also potentially lossy) solution may work sufficiently for you.

If you use the XML package you can actually edit nodes (i.e. delete node elements).