Xavier C. Xavier C. - 2 months ago 35
Python Question

How to handle unknow encoding

I'm having some issues with a Python script that needs to open files with different encoding.

I'm usually using this:

with open(path_to_file, 'r') as f:
first_line = f.readline()


And that works great when the file is properly encode.

But sometimes, it doesn't work, for example with this file, I've got this:

In [22]: with codecs.open(filename, 'r') as f:
...: a = f.readline()
...: print(a)
...: print(repr(a))
...:
��Test for StackOverlow

'\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'


And I would like to search some stuff on those lines. Sadly with that method, I can't:

In [24]: "Test" in a
Out[24]: False


I've found a lot of questions here referring to the same type of issues:


  1. Unicode (utf8) reading and writing to files in python

  2. UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

  3. http://programmers.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file

  4. how can i escape '\xff\xfe' to a readable string



But can't manage to decode the file properly with them...

With codecs.open():

In [17]: with codecs.open(filename, 'r', "utf-8") as f:
a = f.readline()
print(a)
....:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-17-0e72208eaac2> in <module>()
1 with codecs.open(filename, 'r', "utf-8") as f:
----> 2 a = f.readline()
3 print(a)
4

/usr/lib/python2.7/codecs.pyc in readline(self, size)
688 def readline(self, size=None):
689
--> 690 return self.reader.readline(size)
691
692 def readlines(self, sizehint=None):

/usr/lib/python2.7/codecs.pyc in readline(self, size, keepends)
543 # If size is given, we call read() only once
544 while True:
--> 545 data = self.read(readsize, firstline=True)
546 if data:
547 # If we're at a "\r" read one extra character (which might

/usr/lib/python2.7/codecs.pyc in read(self, size, chars, firstline)
490 data = self.bytebuffer + newdata
491 try:
--> 492 newchars, decodedbytes = self.decode(data, self.errors)
493 except UnicodeDecodeError, exc:
494 if firstline:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte


with encode('utf-8):

In [18]: with codecs.open(filename, 'r') as f:
a = f.readline()
print(a)
....: a.encode('utf-8')
....: print(a)
....:
��Test for StackOverlow

---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-18-7facc05b9cb1> in <module>()
2 a = f.readline()
3 print(a)
----> 4 a.encode('utf-8')
5 print(a)
6

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)


I've found a way to change file encoding automatically with Vim:

system("vim '+set fileencoding=utf-8' '+wq' %s" % path_to_file)


But I would like to do this without using Vim...

Any help will be appreciate.

Answer

it looks like this is utf-16-le (utf-16 little endian ...) but you are missing a final \x00

>>> s = '\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x
00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'
>>> s.decode('utf-16-le') # creates error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python26\lib\encodings\utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 46: truncat
ed data
>>> (s+"\x00").decode("utf-16-le") # TADA!!!!
u'\ufeffTest for StackOverlow\r\n'
>>>