Minh Le Minh Le - 1 year ago 84
Perl Question

Perl substr based on bytes

I'm using SimpleDB for my application. Everything goes well unless the limitation of one attribute is 1024 bytes. So for a long string I have to chop the string into chunks and save it.

My problem is that sometimes my string contains unicode character (chinese, japanese, greek) and the substr() function is based on character count not byte.

I tried to use

use bytes
for byte semantic or later
substr(encode_utf8($str), $start, $length)
but it does not help at all.

Any help would be appreciated.

Answer Source

To split the string into chunks of valid UTF-8, use

my $utf8 = encode_utf8($text);
my @utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;

Then either

# Store expects bytes.
store($_) for @utf8_chunks;


# Store expects decoded text.
store(decode_utf8($_)) for @utf8_chunks;
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download