Katianie Katianie - 1 month ago 16
C Question

Unable to read UNICODE text file in C

(I looked at previous posts and tried what they suggested but to no avail.)

I'm attempting to read in a file containing only Japanese characters. Here is what that file looks like:

わたし わ エドワド オ’ハゲン です。 これ は なん です か?

When I attempt to read it, nothing is displayed as output in the console and when debugging, the read buffer is just garbage. Here is the function I am using to read in the file:

wchar_t* ReadTextFileW(wchar_t* filePath, size_t numBytesToRead, size_t maxBufferSize, const wchar_t* mode, int seekOffset, int seekOrigin)
{
size_t numItems = 0;
size_t bufferSize = 0;
wchar_t* buffer = NULL;
FILE* file = NULL;

//Ensure the filePath does NOT lead to a device.
if (IsPathADevice(filePath) == false)
{
//0 indicates to read as much as possible (the max specified).
if (numBytesToRead == 0)
{
numBytesToRead = maxBufferSize;
}

if (filePath != NULL && mode != NULL)
{
//Ensure there are no errors in opening the file.
if (_wfopen_s(&file, filePath, mode) == 0)
{
//Set the cursor location (back to the beginning of the file by default).
if (fseek(file, seekOffset, seekOrigin) != 0)
{
//Error: Could not change file cursor position.
fclose(file);
return NULL;
}

//Calculate the size of the buffer in bytes.
bufferSize = numBytesToRead * sizeof(wchar_t);

//Create the buffer to store file data in.
buffer = (wchar_t*)_aligned_malloc(bufferSize, BYTE_ALIGNMENT);

//Ensure the buffer was allocated.
if (buffer == NULL)
{
//Error: Buffer could not be allocated.
fclose(file);
return NULL;
}

//Clear any garbage data in the buffer.
memset(buffer, 0, bufferSize);

//Read the data from the file.
numItems = fread_s(buffer, bufferSize, sizeof(wchar_t), numBytesToRead, file);

//Check for read errors.
if (numItems <= 0)
{
//Error: File could not be read.
fclose(file);
_aligned_free(buffer);
return NULL;
}

//Ensure the file is closed without errors.
if (fclose(file) != 0)
{
//Error: File did not close properly.
_aligned_free(buffer);
return NULL;
}

}
}
}

return buffer;
}


To call this function, I am doing the following. Perhaps I'm not using setlocale() correctly but from what I read it seems like I am. Just to re-iterate, the problem I'm having is that garbage seems to be read in and nothing is displayed in console:

setlocale(LC_ALL, "jp");
wchar_t* retVal = ReadTextFileW(L"C:\\jap.txt");
printf("%S\n", retVal);
_aligned_free(retVal);


I also have the following defined at the top of my .cpp

#define UNICODE
#define _UNICODE


SOLVED:

To fix this, as ryyker mentioned, you need to know the encoding you used to create the original file. In notepad and notepad++ there is a drop down menu for encoding. By default (and what is mostly used) is UTF-8.

Once you know the encoding you can change the read mode of _wfopen_s() to the following.

wchar_t* retVal = ShitFuck::ReadTextFileW(L"C:\\jap.txt", 0, 1024, L"r, ccs=UTF-8");
MessageBoxW(NULL, retVal, NULL, 0);
_aligned_free(retVal);


You must use the message box to print foreign characters.

Answer

This is an excerpt discussing content on encoding for Japanese language, created using Notepad++ (stated in comments as being used by OP)

Double Byte encodings, also called, by usage, Double Byte Character Set (DBCS)

Some of them preexisted Unicode, and were designed to encode character sets with a large number of characters, mainly found in Far East languages with ideographic or syllabic scripts:

The 2 Bytes Universal Character Set : UCS-2 Big Endian and UCS-2 Little Endian
The Japanese Code Page : Shift-JIS ( Windows-932 )
The Chinese Code Pages : Simplified Chinese GB2312 ( Windows-936 ),
Traditionnal Chinese Big5 ( Windows-950 )
The Korean Code Pages : Windows 949, EUC-KR

It would appear that Shift-JIS might be the encoding you are trying to read. From here

Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft...

In general, you need to determine the encoding used to create the multi-byte characters in a file, before they can be correctly read back out by a function in C, or any other language. This link may help.

Comments