A K A K - 9 months ago 188
C# Question

How to parse text from anonymous block in AngleSharp?

I'm parsing site content using AngleSharp and i've got an issue with anonymous block.

See the sample code:

var parser = new HtmlParser();
var document = parser.Parse(@"<body>
<div class='product'>
<a href='#'><img src='img1.jpg' alt=''></a>
Hello, world
<div class='comments-likes'>1</div>
</div>
<div class='product'>
<a href='#'><img src='img2.jpg' alt=''></a>
Yet another helloworld
<div class='comments-likes'>25</div>
</div>
<body>");

var products = document.QuerySelectorAll("div.product");
foreach (var product in products)
{
var productTitle = product.Text();
productTitle.Dump();
}


So, productTitle contains numbers from div.comments-likes, output is:


Hello, world 1

Yet another helloworld 25


I've tried something like
product.FirstElementChild.NextElementSibling.Text();
but next sibling for link element is div.comments-likes, not anonymous block. It shows:


1

25


So, anonymous blocks are skipped. :(

The best workaround i've found is deleting all preventing blocks, for my example:

product.QuerySelector(".comments-likes").Remove();
var productTitle = product.Text().Trim();


Is better way for parsing text from anonymous block?

Answer Source

Text is modeled as a TextNode, it is a type of node beside element, comment node, processing instruction, etc. That's why NextElementSibling you tried didn't include the text in the result since it intended to return elements only, as the name suggests.

You can get text nodes located directly within product div by traversing through the div's ChildNodes and then filter by NodeType, for example :

var products = document.QuerySelectorAll("div.product");
foreach (var product in products)
{
    var productTitle = product.ChildNodes
                              .First(o => o.NodeType == AngleSharp.Dom.NodeType.Text 
                                            && o.TextContent.Trim() != "");
    Console.WriteLine(productTitle.TextContent.Trim());
}

dotnetfiddle demo

Notice that newlines between elements are also text nodes, so we need to filter those out in the demo above.