cocre8or cocre8or - 13 days ago 7
C# Question

Is there a fast way to parse through a large file with regex?

Problem:
Very very, large file I need to parse line by line to get 3 values from each line. Everything works but it takes a long time to parse through the whole file. Is it possible to do this within seconds? Typical time its taking is between 1 minute and 2 minutes.

Example file size is 148,208KB

I am using regex to parse through every line:

Here is my c# code:

private static void ReadTheLines(int max, Responder rp, string inputFile)
{
List<int> rate = new List<int>();
double counter = 1;
try
{
using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 1024))
{
string line;
Console.WriteLine("Reading....");
while ((line = sr.ReadLine()) != null)
{
if (counter <= max)
{
counter++;
rate = rp.GetRateLine(line);
}
else if (max == 0)
{
counter++;
rate = rp.GetRateLine(line);
}
}
rp.GetRate(rate);
Console.ReadLine();
}
}
catch (Exception e)
{
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}
}


Here is my regex:

public List<int> GetRateLine(string justALine)
{
const string reg = @"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$";
Match match = Regex.Match(justALine, reg,
RegexOptions.IgnoreCase);

// Here we check the Match instance.
if (match.Success)
{
// Finally, we get the Group value and display it.

string theRate = match.Groups[3].Value;
Ratestorage.Add(Convert.ToInt32(theRate));
}
else
{
Ratestorage.Add(0);
}
return Ratestorage;
}


Here is an example line to parse, usually around 200,000 lines:


10.10.10.10 - - [27/Nov/2002:16:46:20 -0500] "GET /solr/ HTTP/1.1" 200 4926 789

sll sll
Answer

Memory Mapped Files and Task Parallel Library for help.

  1. Create persisted MMF with multiple random access views. Each view corresponds to a particular part of a file
  2. Define parsing method with parameter like IEnumerable<string>, basically to abstract a set of not parsed lines
  3. Create and start one TPL task per one MMF view with Parse(IEnumerable<string>) as a Task action
  4. Each of worker tasks adds a parsed data into the shared queue of BlockingCollection type
  5. An other Task listen to BC (GetConsumingEnumerable()) and processes all data which already parsed by worker Tasks

See Pipelines pattern on MSDN

Must say this solution is for .NET Framework >=4