Hai Hai - 1 year ago 117
C++ Question

C++ iterate utf-8 string with mixed length of characters

I need to loop over a utf-8 string and get each character of the string. There might be different types of characters in the string, e.g. numbers with the length of one byte, Chinese characters with the length of three bytes, etc.

I looked at this post and it can do 80% of the job, except that when the string has 3-byte chinese characters before 1-byte numbers, it will see the numbers also as having 3 bytes and print the numbers as 1** where * is gibberish.

To give an example, if the string is '今天周五123', the result will be:




where * is gibberish. However if the string is '123今天周五', the numbers will print out fine.

The minimally adapted code from the above mentioned post is copied here:

#include <iostream>
#include "utf8.h"

using namespace std;

int main() {
string text = "今天周五123";

char* str = (char*)text.c_str(); // utf-8 string
char* str_i = str; // string iterator
char* end = str+strlen(str)+1; // end iterator

unsigned char symbol[5] = {0,0,0,0,0};

cout << symbol << endl;

uint32_t code = utf8::next(str_i, end); // get 32 bit code of a utf-8 symbol
if (code == 0)

cout << "utf 32 code:" << code << endl;

utf8::append(code, symbol); // initialize array `symbol`

cout << symbol << endl;

while ( str_i < end );

return 0;

Can anyone help me here? I am new to c++ and although I checked the documentation of utf8 cpp, I still have no idea where the problem is. I think the library was created to handle such issues where you have utf-8 encodings with different lengths, so there should be a way to do this... Have been struggling with this for two days...

Answer Source


memset(symbol, 0, sizeof(myarray));


utf8::append(code, symbol);  

If this for some reason still doesn't work, or if you want to get rid of the lib, recognizing codepoints is not that complicated:

string text = "今天周五123";
for(size_t i = 0; i < text.length();)
    int cplen = 1;
    if((text[i] & 0xf8) == 0xf0) cplen = 4;
    else if((text[i] & 0xf0) == 0xe0) cplen = 3;
    else if((text[i] & 0xe0) == 0xc0) cplen = 2;
    if((i + cplen) > text.length()) cplen = 1:

    cout << text.substr(i, cplen) << endl;
    i += cplen;

With both solution, however, be aware that multi-cp glyphs exist, as well as cp's that can't be printed alone