SearchForKnowledge SearchForKnowledge - 1 month ago 20
C# Question

How to use HTMLAgilityPack to extract HTML data

I am learning to write web crawler and found some great examples to get me started but since I am new to this, I have a few questions in regards to the coding method.

The search result for example can be found here: Search Result

When I look at the HTML source for the result I can see the following:

<HR><CENTER><H3>License Information *</H3></CENTER><HR>
<P>
<CENTER> 06/03/2014 </CENTER> <BR>
<B>Name : </B> WILLIAMS AJAYA L <BR>
<B>Address : </B> NEW YORK NY <BR>
<B>Profession : </B> ATHLETIC TRAINER <BR>
<B>License No: </B> 001475 <BR>
<B>Date of Licensure : </B> 01/12/07 <BR>
<B>Additional Qualification : </B> &nbsp; Not applicable in this profession <BR>
<B> <A href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> REGISTERED <BR>
<B>Registered through last day of : </B> 08/15 <BR>


How can I use the HTMLAgilityPack to scrap those data from the site?

I was trying to implement an example as shown below, but not sure where to make the edit to get it working to crawl the page:

private void btnCrawl_Click(object sender, EventArgs e)
{
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
filename = Path.GetFileNameWithoutExtension( ie.FullName ).ToLower();

if ( filename.Equals( "iexplore" ) )
txtURL.Text = "Now Crawling: " + ie.LocationURL.ToString();
}
string url = ie.LocationURL.ToString();
string xmlns = "{http://www.w3.org/1999/xhtml}";
Crawler cl = new Crawler(url);
XDocument xdoc = cl.GetXDocument();
var res = from item in xdoc.Descendants(xmlns + "div")
where item.Attribute("class") != null && item.Attribute("class").Value == "folder-news"
&& item.Element(xmlns + "a") != null
//select item;
select new
{
Link = item.Element(xmlns + "a").Attribute("href").Value,
Image = item.Element(xmlns + "a").Element(xmlns + "img").Attribute("src").Value,
Title = item.Elements(xmlns + "p").ElementAt(0).Element(xmlns + "a").Value,
Desc = item.Elements(xmlns + "p").ElementAt(1).Value
};
foreach (var node in res)
{
MessageBox.Show(node.ToString());
tb.Text = node + "\n";
}
//Console.ReadKey();
}


The Crawler helper class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;

namespace CrawlerWeb
{
public class Crawler
{

public string Url
{
get;
set;
}
public Crawler() { }
public Crawler(string Url)
{
this.Url = Url;
}
public XDocument GetXDocument()
{
HtmlAgilityPack.HtmlWeb doc1 = new HtmlAgilityPack.HtmlWeb();
doc1.UserAgent = "Mozilla/4.0 (conpatible; MSIE 7.0; Windows NT 5.1)";
HtmlAgilityPack.HtmlDocument doc2 = doc1.Load(Url);
doc2.OptionOutputAsXml = true;
doc2.OptionAutoCloseOnEnd = true;
doc2.OptionDefaultStreamEncoding = System.Text.Encoding.UTF8;
XDocument xdoc = XDocument.Parse(doc2.DocumentNode.SelectSingleNode("html").OuterHtml);
return xdoc;
}
}
}


tb
is a multiline textbox... So I would like it to display the following:

Name
WILLIAMS AJAYA L


Address
NEW YORK NY


Profession
ATHLETIC TRAINER


License No
001475


Date of Licensure
1/12/07


Additional Qualification
Not applicable in this profession


Status
REGISTERED


Registered through last day of
08/15


I would like the second argument to be added to an array because next step would be to write to a SQL database...

I am able to get the URL from the IE which has the search result but how can I code it in my script?

Answer

This little snippet should get you started:

HtmlDocument doc = new HtmlDocument();
WebClient client = new WebClient();
string html = client.DownloadString("http://www.nysed.gov/coms/op001/opsc2a?profcd=67&plicno=001475&namechk=WIL");
doc.LoadHtml(html);

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div");

You basically use the WebClient class to download the HTML file and then you load that HTML into the HtmlDocument object. Then you need to use XPath to query the DOM tree and search for nodes. In the above example "nodes" will include all the div elements in the document.

Here's a quick reference about the XPath syntax: http://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx