Ivri Ivri - 2 months ago 9
Python Question

Can't decode a cyrillic string in python

I have an encoded file with strings like

b'1' b'\xca\xee\xef\xe5\xe9\xf1\xea' b'1' b'ADMIN' b'2013-07-08 00:21:55'
b'2' b'\xd7\xe5\xeb\xff\xe1\xe8\xed\xf1\xea' b'1' b'ADMIN' b'2013-07-08 00:22:05'


How should I decode it? I tried to use codecs, decode/encode cp1251, it didn't work.
file -bi says charset=us-ascii

There should be astring in cyrillic(cp1251) actually

python 2.7

The output:

>>> w=r'\xd7\xe5\xe\xff\xe1\xe8\xed\xf1\xea'
>>> w='\xd7\xe5\xe\xff\xe1\xe8\xed\xf1\xea'
ValueError: invalid \x escape
>>> w=r'\xd7\xe5\xe\xff\xe1\xe8\xed\xf1\xea'
>>> w.decode('raw_unicode_escape')
u'\\xd7\\xe5\\xe\\xff\\xe1\\xe8\\xed\\xf1\\xea'
>>> w.decode('utf-8')
u'\\xd7\\xe5\\xe\\xff\\xe1\\xe8\\xed\\xf1\\xea'
>>> unicode(w)
u'\\xd7\\xe5\\xe\\xff\\xe1\\xe8\\xed\\xf1\\xea'
>>> unicode(w, 'utf-8')
u'\\xd7\\xe5\\xe\\xff\\xe1\\xe8\\xed\\xf1\\xea'


I did everything: decode("utf-8"), used unicode and so, but nothing changes, everytime I get the same set of bytes.

Answer

The issue is you are missing a b after the 3rd \x escape in your w variable when it says invalid escape.

>>> w = '\xd7\xe5\xeb\xff\xe1\xe8\xed\xf1\xea'
>>> w.decode('cp1251')
u'\u0427\u0435\u043b\u044f\u0431\u0438\u043d\u0441\u043a'