Jeremy Schutte Jeremy Schutte - 1 year ago 89
Python Question

Converting double slash utf-8 encoding

I cannot get this to work! I have a text file from a save game file parser with a bunch of UTF-8 Chinese names in it in byte form, like this in the source.txt:


But, no matter how I import it into Python (3 or 2), I get this string, at best:


I have tried, like other threads have suggested, to re-encode the string as UTF-8 and then decode it with unicode escape, like so:


But then it messes up the original encoding, and gives this as the string:

'æ\x89\x8eå\x8a\xa0æ\x8b\x89' (printing this string results in: æå æ )

Now, if I manually copy and paste b + the original string in the filename and encode this, I get the correct encoding. For example:


Results in: '扎加拉'

But, I can't do this programmatically. I can't even get rid of the double slashes.

To be clear, source.txt contains single backslashes. I have tried importing it in many ways, but this is the most common:

with open('source.txt','r',encoding='utf-8') as f_open:
source =

Okay, so I clicked the answer below (I think), but here is what works:

from ast import literal_eval
decodedString = literal_eval("b'{}'".format(stringVariable)).decode('utf-8')

I can't use it on the whole file because of other encoding issues, but extracting each name as a string (stringVariable) and then doing that works! Thank you!

Answer Source

I'm assuming you're using Python 3. In Python 2, strings are bytes by default, so it would just work for you. But in Python 3, strings are unicode and interpretted as unicode, which is what makes this problem harder if you have a byte string being read as unicode.

This solution was inspired by mgilson's answer. We can literally evaluate your unicode string as a byte string by using literal_eval:

from ast import literal_eval

with open('source.txt', 'r', encoding='utf-8') as f_open:
    source =
    string = literal_eval("b'{}'".format(source)).decode('utf-8')
    print(string)  # 扎加拉