Belterius Belterius - 3 months ago 18
C# Question

Html agility xpath get following node if

I have an html document structured as:

<h3><a name="sect55">55</a></h3>
<p></p>
<p class="choice"><a href="#sect325"></a></p>

<h3><a name="sect56"></a></h3>
<p></p>
<p class="choice"><a href="#sect222"></a></p>

<h3><a name="sect57"></a></h3>
<p></p>
<p class="choice"><a href="#sect164"></a></p>
<p class="choice"><a href="#sect109"></a></p>
<p class="choice"><a href="#sect308"></a></p>


I want to retrieve, in a separate List, all the nodes until the next section, so until the next
<h3>
.

For now I'm using:

for (int paragraph = xx; paragraph <= yy; paragraph++)
{
nameActual = "sect" + paragraph;
nameNext = "sect" + (paragraph + 1);
HtmlNodeCollection NodeOfParagraph = doc.DocumentNode.SelectNodes(String.Format("//h3[a[@name='{0}']]/following-sibling::p[following::h3/a[@name='{1}']]", nameActual, nameNext));

//Multiples actions on my NodeOfParagraph
}


So I select my first
<h3>
that possesses an
<a>
of the value I'm looking for, and I then select all the
<p>
nodes that possess a following node with an
<a>
of my next value.

It works, but takes a really long time, I suppose because for each node it tests all the other node for their value.

How can I improve my query performances ?

Answer

You could do the following:

  1. Find all the section definitions and store them in a list
  2. Loop through the section definitions
    • and get all the nodes between this section and the next one (or the end of the document if there are no more section definitions) by specifying the exact name of the next section in the query
var doc = new HtmlDocument();
doc.Load(@"path\to\file.html");
var sects = doc.DocumentNode.SelectNodes("//h3[a[starts-with(@name, 'sect')]]");

for (var index = 0; index < sects.Count; index ++)
{
    var isLast = (index == sects.Count - 1);
    var xpath = ".//following-sibling::p";
    if (!isLast)
        xpath += string.Format("[following-sibling::h3[1][a/@name = '{0}']]", sects[index + 1].SelectSingleNode("./a").Attributes["name"].Value);
    var collection = sects[index].SelectNodes(xpath);

}

This will have the advantage of:

  • not trying to find a section number that doesn't exist
  • using the context node (starting the query with ./) so that unnecessary parts of the document are not searched
  • stop at the next h3 (h3[1]), so that unnecessary parts of the document are not searched
  • only search siblings and not descendants (following-sibling:: instead of following::)