Michael Michael - 1 month ago 13
C++ Question

Using seekg() in text mode

While trying to read in a simple ANSI-encoded text file in text mode (Windows), I came across some strange behaviour with seekg() and tellg(); Any time I tried to use tellg(), saved its value (as pos_type), and then seek to it later, I would always wind up further ahead in the stream than where I left off.

Eventually I did a sanity check; even if I just do this...

int main()
{
std::ifstream dataFile("myfile.txt",
std::ifstream::in);
if (dataFile.is_open() && !dataFile.fail())
{
while (dataFile.good())
{
std::string line;
dataFile.seekg(dataFile.tellg());
std::getline(dataFile, line);
}
}
}


...then eventually, further into the file, lines are half cut-off. Why exactly is this happening?

Answer

This issue is caused by libstdc++ using the difference between the current remaining buffer with lseek64 to determine the current offset.

The buffer is set using the return value of read, which for a text mode file on windows returns the number of bytes that have been put into the buffer after endline conversion (i.e. the 2 byte \r\n endline is converted to \n, windows also seems to append a spurious newline to the end of the file).

lseek64 however (which with mingw results in a call to _lseeki64) returns the current absolute file position, and once the two values are subtracted you end up with an offset that is off by 1 for each remaining newline in the text file (+1 for the extra newline).

The following code should display the issue, you can even use a file with a single character and no newlines due to the extra newline inserted by windows.

#include <iostream>
#include <fstream>

int main()
{
  std::ifstream f("myfile.txt");

  for (char c; f.get(c);)
    std::cout << f.tellg() << ' ';
}

For a file with a single a character I get the following output

2 3

Clearly off by 1 for the first call to tellg. After the second call the file position is correct as the end has been reached after taking the extra newline into account.

Aside from opening the file in binary mode, you can circumvent the issue by disabling buffering

#include <iostream>
#include <fstream>

int main()
{
  std::ifstream f;
  f.rdbuf()->pubsetbuf(nullptr, 0);
  f.open("myfile.txt");

  for (char c; f.get(c);)
    std::cout << f.tellg() << ' ';
}

but this is far from ideal.

Hopefully mingw / mingw-w64 or gcc can fix this, but first we'll need to determine who would be responsible for fixing it. I suppose the base issue is with MSs implementation of lseek which should return appropriate values according to how the file has been opened.

Comments