Belterius Belterius - 2 months ago 6x
C# Question

Html agility xpath get following node if

I have an html document structured as:

<h3><a name="sect55">55</a></h3>
<p class="choice"><a href="#sect325"></a></p>

<h3><a name="sect56"></a></h3>
<p class="choice"><a href="#sect222"></a></p>

<h3><a name="sect57"></a></h3>
<p class="choice"><a href="#sect164"></a></p>
<p class="choice"><a href="#sect109"></a></p>
<p class="choice"><a href="#sect308"></a></p>

I want to retrieve, in a separate List, all the nodes until the next section, so until the next

For now I'm using:

for (int paragraph = xx; paragraph <= yy; paragraph++)
nameActual = "sect" + paragraph;
nameNext = "sect" + (paragraph + 1);
HtmlNodeCollection NodeOfParagraph = doc.DocumentNode.SelectNodes(String.Format("//h3[a[@name='{0}']]/following-sibling::p[following::h3/a[@name='{1}']]", nameActual, nameNext));

//Multiples actions on my NodeOfParagraph

So I select my first
that possesses an
of the value I'm looking for, and I then select all the
nodes that possess a following node with an
of my next value.

It works, but takes a really long time, I suppose because for each node it tests all the other node for their value.

How can I improve my query performances ?


You could do the following:

  1. Find all the section definitions and store them in a list
  2. Loop through the section definitions
    • and get all the nodes between this section and the next one (or the end of the document if there are no more section definitions) by specifying the exact name of the next section in the query
var doc = new HtmlDocument();
var sects = doc.DocumentNode.SelectNodes("//h3[a[starts-with(@name, 'sect')]]");

for (var index = 0; index < sects.Count; index ++)
    var isLast = (index == sects.Count - 1);
    var xpath = ".//following-sibling::p";
    if (!isLast)
        xpath += string.Format("[following-sibling::h3[1][a/@name = '{0}']]", sects[index + 1].SelectSingleNode("./a").Attributes["name"].Value);
    var collection = sects[index].SelectNodes(xpath);


This will have the advantage of:

  • not trying to find a section number that doesn't exist
  • using the context node (starting the query with ./) so that unnecessary parts of the document are not searched
  • stop at the next h3 (h3[1]), so that unnecessary parts of the document are not searched
  • only search siblings and not descendants (following-sibling:: instead of following::)