mkk mkk - 1 month ago 16
C++ Question

seekg and imbue in wifstream work wrong

I have a file like below:

$ xxd 1line
0000000: 3939 ba2f 6f20 6f66 0d0a 99./o of..


I would like to read this one line in C++:

#include <codecvt>
#include <iostream>
#include <locale>
#include <fstream>
#include <string>

int main(int argc, char** argv) {
std::wifstream wss(argv[1], std::ios::binary);
wss.seekg(std::ios_base::end);
const auto fileSize = wss.tellg();
wss.seekg(std::ios_base::beg);

// std::locale utf8_locale(wss.getloc(), new std::codecvt_utf8<wchar_t, 0x10FFFF, std::consume_header>);
// wss.imbue(utf8_locale);

std::wstring wline;
std::getline(wss, wline);

std::cout << "filelen: " << fileSize << std::endl;
std::cout << "strlen: " << wline.size() << std::endl;
std::wcout << "str: " << wline << std::endl;

return 0;
}


I compile it in below way:

$ g++ -std=c++11 imbue_issue.cpp


First thing: it seems that wss.seekg(std::ios_base::end) does not moves file position at the end of the file:

$ ./a.out 1line
filelen: 2
strlen: 9
str: 99?/o of


Second thing is when uncomment locale related lines, getline reads only 2 characters:

$ ./a.out 1line
filelen: 2
strlen: 2
str: 99


My compiler:

$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/c++/4.2.1
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin


Does anyone have idea what is the reason why above issues occur with this file?

Answer

The problem is how you call the seekg function. When you call it with one argument it is used as an absolute position from the beginning, and you will seek to whatever value std::ios::end have, which happens to be 2 in your case.

Instead you should use the two-argument overload:

wss.seekg(0, std::ios_base::end);  // Seek to offset 0 from the end

You will still have problems reading the file using wide-character types, since the contents doesn't seem to be that. UTF-8 is a multi-byte narrow character encoding.

Comments