moby moby - 1 month ago 13
Python Question

Truncating string to byte length in Python

I have a function here to truncate a given string to a given byte length:

LENGTH_BY_PREFIX = [
(0xC0, 2), # first byte mask, total codepoint length
(0xE0, 3),
(0xF0, 4),
(0xF8, 5),
(0xFC, 6),
]

def codepoint_length(first_byte):
if first_byte < 128:
return 1 # ASCII
for mask, length in LENGTH_BY_PREFIX:
if first_byte & mask == mask:
return length
assert False, 'Invalid byte %r' % first_byte

def cut_string_to_bytes_length(unicode_text, byte_limit):
utf8_bytes = unicode_text.encode('UTF-8')
cut_index = 0
while cut_index < len(utf8_bytes):
step = codepoint_length(ord(utf8_bytes[cut_index]))
if cut_index + step > byte_limit:
# can't go a whole codepoint further, time to cut
return utf8_bytes[:cut_index]
else:
cut_index += step
# length limit is longer than our bytes strung, so no cutting
return utf8_bytes


This seemed to work fine until the question of Emoji was introduced:

string = u"\ud83d\ude14"
trunc = cut_string_to_bytes_length(string, 100)

Traceback (most recent call last):
File "<console>", line 1, in <module>
File "<console>", line 5, in cut_string_to_bytes_length
File "<console>", line 7, in codepoint_length
AssertionError: Invalid byte 152


Can anyone explain exactly what is going on here, and what a possible solution is?

Edit: I have another code snippet here that doesn't throw an exception, but has weird behavior sometimes:

import encodings
_incr_encoder = encodings.search_function('utf8').incrementalencoder()

def utf8_byte_truncate(text, max_bytes):
""" truncate utf-8 text string to no more than max_bytes long """
byte_len = 0
_incr_encoder.reset()
for index,ch in enumerate(text):
byte_len += len(_incr_encoder.encode(ch))
if byte_len > max_bytes:
break
else:
return text
return text[:index]

>>> string = u"\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14\ud83d\ude14"
>>> print string
(prints a set of 5 Apple Emoji...)

Answer

If a number f is such that f & 0xF0 == 0xF0, then it is also the case that f & 0xC0 == 0xC0 because 0xF0 has all the bits that 0xC0 has, and then some. That is, among other problems your codepoint_length() function will return a step of 2 when it should be 4. If you reverse your LENGTH_BY_PREFIX list, the function works ok with the first example.

LENGTH_BY_PREFIX = [
  (0xFC, 6),
  (0xF8, 5),
  (0xF0, 4),
  (0xE0, 3), 
  (0xC0, 2), # first byte mask, total codepoint length
]
Comments