Svetsi Svetsi -4 years ago 191
C++ Question

Codepoint from Unicode Character?

This question has been asked before but it's solution is dependent on the Microsoft Foundation Class which I don't want to rely on. Basically what I wan to do is convert a Unicode character into a it's equivalent codepoint.

The below was the solution using MFC. Is there a way of doing this without using afxwin.h ?

#include <afxwin.h>

#include <iostream>

int main() {
using namespace std;

TCHAR myString[50] = _T("عربى");
int stringLength = _tcslen(myString); // <----- edit here

for(int i=0;i<stringLength;i++)
{
unsigned int number =myString[i];
cout<<number<<endl;
}
}
Output:

1593
1585
1576
1609

Answer Source

Update

If your compiler supports it, the easiest way to do this is probably to write your constant string as U"عربى". This gives you an array of char32_t characters whose code points are just their value converted with static_cast<uint32_t>(). To print them in standard format, just prepend U+ and print the hex value.

Try this on a C++14 compiler (I recommend saving the source file as utf-8).

#include <cstdlib>
#include <iomanip>
#include <iostream>

using std::cout;

int main()
{
  constexpr char32_t codepoints[] = U"عربى";
  constexpr size_t n = sizeof(codepoints)/sizeof(char32_t);

  cout.setf( cout.hex, cout.basefield );     // Output in hex
  cout.setf( cout.right, cout.adjustfield ); // Prepending
  cout.fill('0');                            // leading zeroes
  // Fixed: Don’t print the terminating U'\0'.
  for ( size_t i = 0; i < n && codepoints[i]; ++i )
    cout << "U+" << std::setw(4) << (unsigned long)codepoints[i] << std::endl;

  return EXIT_SUCCESS;
}

Conversions

The C++ STL has <codecvt> now, which can convert from utf-8 or utf-16 to ucs-32. Example code (from http://en.cppreference.com/w/cpp/locale/codecvt_utf16):

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

void prepare_file()
{
  // UTF-16le data (if host system is little-endian)
  char16_t utf16le[4] ={0x007a, // latin small letter 'z' U+007a
                        0x6c34, // CJK ideograph "water"  U+6c34
                        0xd834, 0xdd0b}; // musical sign segno U+1d10b
  // store in a file
  std::ofstream fout("text.txt");
  fout.write( reinterpret_cast<char*>(utf16le), sizeof utf16le);
}

int main() 
{
  prepare_file(); // open as a byte stream
  std::wifstream fin("text.txt", std::ios::binary); 
  // apply facet
  fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));

  for (wchar_t c; fin.get(c); )
    std::cout << std::showbase << std::hex << c << '\n';
}

C11 and C++11 also have functions to convert between multibyte utf-8 and utf-16 and wide character strings (from here: http://en.cppreference.com/w/c/string/multibyte/mbrtoc32). The mbstowcs() function might be relevant, too.

#include <stdio.h>
#include <locale.h>
#include <string.h>
#include <uchar.h>
#include <assert.h>   

mbstate_t state;

int main(void)
{
  setlocale(LC_ALL, "en_US.utf8");
  char *str = u8"z\u00df\u6c34\U0001F34C"; // or u8"zß水
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download