HoboBen HoboBen - 2 months ago 14
C Question

Iterate backwards through a utf8 multibyte string

I use a slightly modified version of this function

is_utf8
http://stackoverflow.com/a/1031773/275677 to extract UTF8 sequences from a character array, returning the sequence and how many bytes in it so that I can iterate over a string in this way.

However I would now like to iterate backwards over a string (
char *
). What is the best way to do this?




My guess is to try to classify the last four, three, two and one bytes of the string as utf8 (four times) and pick the longest.

However, is it ever the case that utf8 is ambigious? For example can
aaaabb
parsed as
aaaa.bb
also be parsed (backwards) as
aa.aabb
where
aa
,
aaaa
,
bb
and
aabb
are valid utf8 sequences?

Answer

A string consists of a series of UTF-8 sequences. All UTF-8 sequences:

  • EITHER consist of exactly one octet (byte to you and me) with the top bit clear

  • OR consist of one octet with the two topmost bits set, followed by one or more octets with bit 7 set and bit 6 clear.

See http://en.wikipedia.org/wiki/Utf8#Description for details.

So what you need to do is to check whether the character concerned has bit 7 set and bit 6 clear, and if so step back one, taking care not to go beyond the start of the string (note that if the string is well formed, this won't happen).

Untested C-ish pseudocode:

char *
findPrevious (const char *ptr, const char *start)
{
    do
    {
        if (ptr <= start)
            return NULL; /* we're already at the start of the string */
        ptr--;
    } while ((*ptr & 0xC0) == 0x80);
    return ptr;
}