JasonSmith JasonSmith - 17 days ago 5
JSON Question

Truncating unicode so it fits a maximum size when encoded for wire transfer

Given a Unicode string and these requirements:


  • The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)

  • The encoded string has a maximum length



For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes.

What is the best way to truncate the string so that it re-encodes to valid Unicode and that it displays reasonably correctly?

(Human language comprehension is not necessary—the truncated version can look odd e.g. for an orphaned combining character or a Thai vowel, just as long as the software doesn't crash when handling the data.)

See Also:


Answer
def unicode_truncate(s, length, encoding='utf-8'):
    encoded = s.encode(encoding)[:length]
    return encoded.decode(encoding, 'ignore')

Here is an example for unicode string where each character is represented with 2 bytes in UTF-8:

>>> unicode_truncate(u'абвгд', 5)
u'\u0430\u0431'
Comments