Gabriel Den Hartog Gabriel Den Hartog - 1 month ago 8
C# Question

Regex with conditional html tag

I need to write a Regex that captures whats inside a specific HTML tag:

<span class="sentences">CAPTURE HERE</span>


So I wrote, in C#:

<span class=\"sentence\">((.|\\\\s)*?)</span>


The problem I'm having and I'm not sure how to solve it, is that there is another span class inside that span that also ends with </span> and therefore is ending the capture on the wrong closing tag. How do I write a condition in a Regex that checks if there is another span class that is not "sentences" and if it does, that the capture should end on the next </span>?

The input string on the Regex.

<span class="sentence">O que a história da escravidão tem a dizer sobre <span class="CharOverride-15">experiências religiosas</span>?</span><span class="sentence"> Só silêncios,</span>


What I want to ideally capture:

O que a história da escravidão tem a dizer sobre <span class="CharOverride-15">experiências religiosas</span>? Só silêncios,

Answer Source

Don't use Regex to parse html. Use a real html parser like HtmlAgilityPack

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlstring);
var span = doc.DocumentNode.SelectSingleNode("//span[@class='sentence']");
var text = span.InnerText;
var html = span.InnerHtml;