Oliver Frost Oliver Frost - 14 days ago 5
R Question

Removing specific tags with regex while preserve contents

I have a specific problem with a body of text containing HTML tags that can be solved by removing specific tags and keeping the content of those tags (essentially taking the text up one level in the hierarchy).

For example, I would like:

<div>
<div class="meta">Wednesday, 2 November 2016 at 15:52 UTC</div>
<div class="comment">My life this weekend</div>
<p></p>
</div>


To become:

<div>
<div class="meta">Wednesday, 2 November 2016 at 15:52 UTC</div>
My life this weekend
<p></p>
</div>


I am using
library(XML)
to parse the tags once they are cleaned, so using XML::xpathSApply() is not what I need here.

I believe the solution lies in some sort of regex expression that matches a single pattern containing both of the tags and ignoring the text between them and performing a straight replace with " ". Lookahead is required too I believe, but I am new to regex and struggling with it a bit.

The
<div class="comment"></div>
tags themselves are consistent and do not contain random amounts of whitespace.

Thanks!

Answer
text <- "<div>
<div class=\"meta\">Wednesday, 2 November 2016 at 15:52 UTC</div>
<div class=\"comment\">My life this weekend</div>
<p></p>
</div>"

m <- gsub("<div class=\"comment\">(.*?)</div>", "\\1", text, perl = TRUE)
cat(m)

<div>
<div class="meta">Wednesday, 2 November 2016 at 15:52 UTC</div>
My life this weekend
<p></p>
</div>
Comments