Andrea S. Andrea S. - 11 months ago 89
C# Question

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:

public string findCollegeURL(string catalog, string college)
//Find college
HtmlDocument doc = new HtmlDocument();
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();

// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
else if(linkName == colleges[college] && counter == 1)
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */

return "DID NOT WORK";

The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?

Here is the code in the HTML doc that I'm trying to access:

<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<a href="/content.php?catoid=10&navoid=1210" class"navbar" tabindex="119">College of Science</a> ==$0

This is the link that I want: /content.php?catoid=10&navoid=1210

Answer Source

I find using XPath easier to use instead of writing a lot of code

var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")

If you have 2 links with the same text, to select the 2nd one

var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")

The Linq version of it

var links = doc.DocumentNode.Descendants("a")
               .Where(a => a.InnerText == "College of Science")
               .Select(a => a.Attributes["href"].Value)