Andrea S. Andrea S. - 18 days ago 9
C# Question

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:

public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();

// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}

return "DID NOT WORK";
}


The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?

Here is the code in the HTML doc that I'm trying to access:

<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<a href="/content.php?catoid=10&navoid=1210" class"navbar" tabindex="119">College of Science</a> ==$0
</div>


This is the link that I want: /content.php?catoid=10&navoid=1210

L.B L.B
Answer

I find using XPath easier to use instead of writing a lot of code

var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
              .Attributes["href"].Value;

If you have 2 links with the same text, to select the 2nd one

var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
              .Attributes["href"].Value;

The Linq version of it

var links = doc.DocumentNode.Descendants("a")
               .Where(a => a.InnerText == "College of Science")
               .Select(a => a.Attributes["href"].Value)
               .ToList();
Comments