John S John S - 21 days ago 8
C# Question

Finding a specific node using Xpath and Linq

Using HtmlAgilityPack and Linq and the following html string I am trying to get the "Last Date to file:" date. The XPath has eluded me

<table>
<tbody>
<tr>
<td><b></b> John E. Clement
</td>
<td>
<b></b>
</td>
<td>
<b>Chapter: </b>1
</td>
</tr>
<tr>
<td>
<b>Office:/b>Littleton
</td>
<td>
<b>&nbsp;&nbsp; &nbsp;&nbsp; </b>
</td>
<td><b>Last Date to file: </b>**04/18/2017**</td>
</tr>
<tr>
<td><b>Boss: </b>Michael Meyer </td>
<td><b></b></td>
<td><b>Last Date to file again: </b>06/06/2018</td>
</tr>
</tbody>
</table>


My c# code is:

HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("*My file with the html above*");
var lastDate = doc.DocumentNode.Descendants().Where(a=>a.InnerText.Contains("Last");


It seems that there should be a way to get a single node based on the innertext but I am getting a collection of all the td tags in the document.

Answer

DocumentNode.Descendants() effectively gets all nodes in the document except the root. The InnerText property of a node includes all text contained inside that node, including descendant nodes. For example, given the html

<div>
    This <span>is some <b>text</b></span>
</div>

the InnerText of the div tag is "This is some text".

Therefore, the query doc.DocumentNode.Descendants().Where(a=>a.InnerText.Contains("Last"); will return the b tag that contains "Last", as well as the td tag that contains the b, as well as the tr that contains the td, as well as the table that contains the tr and so on.

Try filtering by node type, as well as InnerText, like so: var lastDate = doc.DocumentNode.Descendants().Where(a => a.Name == "td" && a.InnerText.Contains("Last"));

This returns only 2 td elements.

Comments