Fraiser Fraiser - 1 month ago 13
C# Question

how to locate and store character positions in a text file

I am trying to create a lexicographically sorted index of words along with their position in a text file.

With the help of experts in this forum I am able to create a lexicographically sorted index of words. I now need help with storing the position of the lexicographically sorted index of words

this is what i have so far:-
A text file (sometextfile.txt) containing data as follows:- "This is a sample text file"

private const string filepath = @"d:\sometextfile.txt";
using (StreamReader sr = File.OpenText(filepath))
{
string input;
//dictionary to store the position of the characters in the file as long and the lexicographically sorted value as string
var parts = new Dictionary<long,string>();

while ((input = sr.ReadLine()) != null)
{
string[] words = input.Split(' ');
foreach (var word in words)
{
var sortedSubstrings =
Enumerable.Range(0, word.Length)
.Select(i => word.Substring(i))
.OrderBy(s => s);
parts.AddRange(<store the position of the character>, sortedSubstrings);

}
}
}

Answer

Using ReadLine loses some critical information about your position in the file, if you intend the position to be a byte position that you can seek to. The end of the line could be marked by a carriage return (\r) or a line feed (\n) or both, so you kind of need to know how many bytes were at the end of the line. It's also possible (depending on the encoding of the text file) that characters could be represented with varying numbers of bytes, which may also need to handle. I suggest reading the file at a lower level so you can track your position.

var parts = new Dictionary<long,string>();
using (System.IO.StreamReader sr = new System.IO.StreamReader(myfile))
{
   var sb = new System.Text.StringBuilder();
   long currentPosition = 0;
   long wordPosition = 0;
   bool wordStarted = false;
   int nextCharNum = sr.Read();
   while (nextCharNum >= 0)
   {
      char nextChar = (char)nextCharNum;
      switch(nextChar)
      {
         case ' ':
         case '\r':
         case '\n':
            if (wordStarted)
            {
               parts[wordPosition] = sb.ToString();
               sb.Clear();
               wordStarted = false;
            }
            break;
         default:
            sb.Append(nextChar);
            if (!wordStarted)
            {
               wordPosition = currentPosition;
               wordStarted = true;
            }
            break;
      }
      currentPosition += sr.CurrentEncoding.GetByteCount(nextChar.ToString());
      nextCharNum = sr.Read();
   }
   if (wordStarted)
      parts[wordPosition] = sb.ToString();
}
foreach (var de in parts)
{
   Console.WriteLine("{0} {1}", de.Key, de.Value);
}