giaosudau giaosudau - 7 months ago 38
Python How to fix broken utf-8 encoding?

My string is

Niệm Bồ Tát (Thiá»n sÆ° Nhất Hạnh)

and I wanna decode it to
Niệm Bồ Tát (Thiền sư Nhất Hạnh)

I see in that site can do that

and I start to try by Python

mystr = '09. Bát Nhã Tâm Kinh'

but actually it is not correct because original string is utf-8 but the string show is not my expecting result.

Note: it is Vietnamese character.

How to resolve that case? Is that Windows Unicode or something? How to detect the encoding here.
Thanks in advance


I'm not sure what you can do with these kind of data, but for your example in your original post, this works:

>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
>>> s
u'09. B\xe1t Nh\xe3 T\xe2m Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh