Minh Le Minh Le - 1 month ago 13
Perl Question

Perl substr based on bytes

I'm using SimpleDB for my application. Everything goes well unless the limitation of one attribute is 1024 bytes. So for a long string I have to chop the string into chunks and save it.

My problem is that sometimes my string contains unicode character (chinese, japanese, greek) and the substr() function is based on character count not byte.

I tried to use

use bytes
for byte semantic or later
substr(encode_utf8($str), $start, $length)
but it does not help at all.

Any help would be appreciated.

Answer

To split the string into chunks of valid UTF-8, use

my $utf8 = encode_utf8($text);
my @utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;

Then either

# Store expects bytes.
store($_) for @utf8_chunks;

or

# Store expects decoded text.
store(decode_utf8($_)) for @utf8_chunks;