bomjacob bomjacob - 11 months ago 94
Python Question

UTF-16 codepoint counting in python

I'm getting some data from an API (telegram-bot) I'm using.
I'm using the python-telegram-bot library which interacts with the Telegram Bot api.
The data is returned in the UTF-8 encoding in JSON format.
Example (snippet):

{'message': {'text': '

Answer Source

Python has already correctly decoded the UTF-8 encoded JSON data to Python (Unicode) strings, so there is no need to handle UTF-8 here.

You'd have to encode to UTF-16, take the length of the encoded data, and divide by two. I'd encode to either utf-16-le or utf-16-be to prevent a BOM from being added:

>>> len(text.encode('utf-16-le')) // 2

To use the entity offsets, you can encode to UTF-16, slice on doubled offsets, then decode again:

text_utf16 = text.encode('utf-16-le')
for entity in entities:
    start = entity['offset']
    end = start + entity['length']
    entity_text = text_utf16[start * 2:end * 2].decode('utf-16-le')
    print('Url: ', entity_text)