yolenoyer yolenoyer - 2 months ago 6
Python Question

Convert a numeric utf-8 sequence to a string

I need to convert strings of this kind (where unicode chars are stored in a special way):

Ce correspondant a cherch=C3=A9 =C3=A0 vous joindre


... to a valid utf-8 string, like this:

Ce correspondant a cherché à vous joindre


I wrote the code to extract the numerical utf-8 sequence from this simple syntax
(
=XX=XX
with each
X
as an hex digit), but I'm stuck when I try to convert this
sequence to a printable char : it's a utf-8 sequence, not a Unicode code point, so the
chr()

built-in is not useful here (or at least, not alone).

Briefly:



I need to transform this example value:

utf8_sequence = 0xC3A9


to this string:

return_value = 'é'


The Unicode code point for this letter is
U+00E9
, but I don't know how to pass from
the utf-8 sequence to this given Unicode code point, which could be used with
chr()
.

My code



Here is my code, with a comment showing the place where i'm stuck:

#!/usr/bin/python3
# coding: utf-8

import re

test_string = 'Ce correspondant a cherch=C3=A9 =C3=A0 vous joindre'


# SHOULD convert a string like '=C3=A9' to the equivalent Unicode
# char, in this example 'é'.
def vmg_to_unicode(in_string):

whole_sequence = 0 # Stores the numerical utf-8 sequence
in_length = len(in_string)
num_bytes = int(in_length / 3) # Number of bytes
bit_weight = num_bytes << 3 # Weight of char in bits (little-endian)

for i in range(0, in_length, 3): # For each char:
bit_weight -= 8
# Extract the hex number inside '=XX':
hex_number = in_string[i+1:][:2]
# Build the utf-8 sequence:
whole_sequence += int(hex_number, 16) << bit_weight

# At this point, whole_sequence contains for example 0xC3A9

# The following doesn't work, chr() expect a Unicode code point:
# return chr(whole_sequence)

# HOW CAN I RETURN A STRING LIKE 'é' THERE?

# Only for debug:
return '[0x{:X}]'.format(whole_sequence)


# In a whole string, convert all occurences of patterns like '=C3=A9'
# to their equivalent Unicode chars.
def vmg_transform(in_string):

# Get all occurences:
results = ( m for m in re.finditer('(=[0-9A-Fa-f]{2})+', in_string) )

index, out = (0, '')

for result in results:
# Concat the unchanged text:
out += in_string[index:result.start()]
# Concat the replacement of the matched pattern:
out += vmg_to_unicode(result.group(0))
index = result.end()

# Concat the end of the unchanged string:
out += in_string[index:]

return out


if __name__ == '__main__':
print('In : "{}"'.format(test_string))
print('Out : "{}"'.format(vmg_transform(test_string)))


Current output



In : "Ce correspondant a cherch=C3=A9 =C3=A0 vous joindre"
Out : "Ce correspondant a cherch[0xC3A9] [0xC3A0] vous joindre"


Wanted output



In : "Ce correspondant a cherch=C3=A9 =C3=A0 vous joindre"
Out : "Ce correspondant a cherché à vous joindre"

Answer
  • first create a bytearray
  • populate it
  • then convert to bytes and decode according to UTF-8 encoding

Here's the part of your code to adapt:

    s = bytearray()

    for i in range(0, in_length, 3): # For each char:
        bit_weight -= 8
        # Extract the hex number inside '=XX':
        hex_number = in_string[i+1:][:2]
        # Build the utf-8 sequence:
        s.append(int(hex_number,16))

    # At this point, whole_sequence contains for example 0xC3A9

    # The following doesn't work, chr() expect a Unicode code point:
    # return chr(whole_sequence)

    # HOW CAN I RETURN A STRING LIKE 'é' THERE?

    # Only for debug:
    return bytes(s).decode("utf-8")

result:

In  : "Ce correspondant a cherch=C3=A9 =C3=A0 vous joindre"
Out : "Ce correspondant a cherché à vous joindre"