Volkan Volkan - 1 month ago 11
ASP.NET (C#) Question

Get an element from a large html file

I have a big html file (80 mo) like :

<html>
<head>...</head>
<body>
<div class="nothing">...</div>
<div class="content">
<h1>Hello</h1>
<div>
<div class="phone"> ... </div>
<div class="phone"> ... </div>
<div class="phone"> ... </div>
</div>
<div>
<div class="phone">
...
<div>
...
</div>
...
</div>
<div class="phone"> ... </div>
</div>
<div>
<div class="phone"> ... </div>
<div class="phone"> ... </div>
<div class="phone"> ... </div>
<div class="phone"> ... </div>
</div>
</div>
</body>
</html>


I can't modify this html file manually, so the best is that it stays read-only.

I would like to store each line of
<div class="phone"> ... </div>
in a table of string to be able to manipulate it later. Inside that div, there are also other elements that can be anything.


  1. I tried to use HtmlDocument and XmlDocument to load this file but the file is so big that i get an Out of Memory exception

  2. I tried to use Regex to get all those elements in a table but i couldn't manage it.



The regular expression that i used is:

Regex.Matches(myHtml, "<div class=\"phone\">[\\p{L}\\s]*\\,*[\\p{L}\\s]*<div");


this regex takes every

<div class="phone"> ANY UTF8 char </div>


but the problem is : this regex takes all UTF8 char untill it finds the next
</div>
but this closing div is not necessarily the closing div of the first opening div.

Any ideas how i can make this? Can't we cut this file in different string to be able to load it in a htmlDocument?

Thanks.

Answer

You can use the XmlReader class to read the file. XmlReaderdoes not load the whole file into memory but allows you to move through the XML document node by node while parsing the document on the fly.

Example on how to read the content of all divs with class = phone:

using (XmlReader reader = XmlReader.Create(@"C:\A.html"))
{
     bool isNextNode = true;

     // Loop over all xml tags 
     while (isNextNode)
     {
          // Check we have a div whith attribute class = phone
          if(reader.Name == "div" && reader.GetAttribute("class") == "phone")
          {
               // Yes, so read until the corresponding closing tag and output content
               textBox1.AppendText(reader.ReadInnerXml() + Environment.NewLine);
          }
          else
          {
               // No, read to next tag
               isNextNode = reader.Read();
          }
     }
}

For more details refer to the documentation.

Comments