user7001260 user7001260 - 15 days ago 6
Python Question

String encode/decode issue - missing character from end

I am having

NVARCHAR
type column in my database. I am unable to convert the content of this column to plain string in my code. (I am using
pyodbc
for the database connection).

# This unicode string is returned by the database
>>> my_string = u'\u4157\u4347\u6e65\u6574\u2d72\u3430\u3931\u3530\u3731\u3539\u3533\u3631\u3630\u3530\u3330\u322d\u3130\u3036\u3036\u3135\u3432\u3538\u2d37\u3134\u3039\u352d'

# prints something in chineese
>>> print my_string
䅗䍇湥整⵲㐰㤱㔰㜱㔹㔳㘱㘰㔰㌰㈭㄰〶〶ㄵ㐲㔸ⴷㄴ〹㔭


The closest I have gone is via encoding it to
utf-16
as:

>>> my_string.encode('utf-16')
'\xff\xfeWAGCenter-04190517953516060503-20160605124857-4190-5'
>>> print my_string.encode('utf-16')
��WAGCenter-04190517953516060503-20160605124857-4190-5


But the actual value that I need as per the value store in database is:

WAGCenter-04190517953516060503-20160605124857-4190-51


I tried with encoding it to
utf-8
,
utf-16
,
ascii
,
utf-32
but nothing seemed to work.

Does anyone have the idea regarding what I am missing? And how to get the desired result from the
my_string
.

Edit: On converting it to
utf-16-le
, I am able to remove unwanted characters from start, but still one character is missing from end


>>> print t.encode('utf-16-le')
WAGCenter-04190517953516060503-20160605124857-4190-5


On trying for some other columns, it is working. What might be the cause of this intermittent issue?

Answer

You have a major problem in your database definition, in the way you store values in it, or in the way you read values from it. I can only explain what you are seeing, but neither why nor how to fix it without:

  • the type of the database
  • the way you input values in it
  • the way you extract values to obtain your pseudo unicode string
  • the actual content if you use direct (native) database access

What you get is an ASCII string, where the 8 bits characters are grouped by pair to build 16 bit unicode characters in little endian order. As the expected string has an odd numbers of characters, the last character was (irremediably) lost in translation, because the original string ends with u'\352d' where 0x2d is ASCII code for '-' and 0x35 for '5'. Demo:

def cvt(ustring):
    l = []
    for uc in ustring:
        l.append(chr(ord(uc) & 0xFF)) # low order byte
        l.append(chr((ord(uc) >> 8) & 0xFF)) # high order byte
    return ''.join(l)

cvt(my_string)
'WAGCenter-04190517953516060503-20160605124857-4190-5'