Caumons Caumons - 4 months ago 14
Python Question

Python str vs unicode types

Working with Python 2.7, I'm wondering which real advantage has using type

unicode
instead of
str
, as both of them seem to be able to hold Unicode strings. Is there any special reason a part from being able to set Unicode codes in
unicode
strings using scape char
\
?:

Executing a module with:

# -*- coding: utf-8 -*-

a = 'á'
ua = u'á'
print a, ua


Results in: á, á

EDIT:

More testing using Python shell:

>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'


So, the
unicode
string seems to be encoded using
latin1
instead of
utf-8
and the raw string is encoded using
utf-8
? I'm even more confused now! :S

Answer

unicode, which is python 3's str, is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...). Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.

On the contrary str is a plain sequence of bytes. It does not represent text! In fact, in python 3 str is called bytes.

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.

Some differences that you can see:

>>> len(u'à')  # a single code point
1
>>> len('à')   # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8')  # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�

Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:

>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù

What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.

Comments