tntu tntu - 5 months ago 47
Linux Question

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

I have a socket server that is supposed to receive UTF-8 valid characters from clients.

The problem is some clients (mainly hackers) are sending all the wrong kind of data over it.

I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later.

Sometimes I get characters like this

œ
that cause the
UnicodeDecodeError
error.

I need to be able to make the string UTF-8 with or without those characters.

Answer

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')

Note: This solution will strip out (ignore) the characters in question returning the string without them. Only use this if your need is to strip them not convert them.

For Python 3:

While reading the file:

with codecs.open(file_name, "r",encoding='utf-8', errors='ignore') as fdata:
Comments