W. Goeman W. Goeman - 2 months ago 24
C++ Question

C++ substring multi byte characters

I am having this std::string which contains some characters that span multiple bytes.

When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.

So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.

Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.

Thanks

Answer

There are really only two possible solutions. If you're doing this a lot, over large distances, you'd be better off converting your characters to a single element encoding, using wchar_t (or int32_t, or whatever is most appropriate. This is not a simple copy, which would convert each individual char into the target type, but a true conversion function, which would recognize the multibyte characters, and convert them into a single element.

For occasional use or shorter sequences, it's possible to write your own functions for advancing n bytes. For UTF-8, I use the following:

inline size_t
size(
    Byte                ch )
{
    return byteCountTable[ ch ] ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    size_t              size,
    std::random_access_iterator_tag )
{
    return begin + size ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    size_t              size,
    std::input_iterator_tag )
{
    while ( size != 0 ) {
        ++ begin ;
        -- size ;
    }
    return begin ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    InputIterator       end )
{
    if ( begin != end ) {
        begin = succ( begin, end, size( *begin ),
                      std::::iterator_traits< InputIterator >::iterator_category() ) ;
    }
    return begin ;
}

template< typename InputIterator >
size_t
characterCount(
    InputIterator       begin,
    InputIterator       end )
{
    size_t              result = 0 ;
    while ( begin != end ) {
        ++ result ;
        begin = succ( begin, end ) ;
    }
    return result ;
}
Comments