Stevanus Iskandar Stevanus Iskandar - 29 days ago 25
Python Question

What is a unicode string?

What exactly is a unicode string?

What's the difference between a regular string and unicode string?

What is utf-8?

I'm trying to learn Python right now, and I keep hearing this buzzword. What does the code below do?

i18n Strings (Unicode)

> ustring = u'A unicode \u018e string \xf1'
> ustring
u'A unicode \u018e string \xf1'

## (unistring from above contains a unicode string)
> s = unistring.encode('utf-8')
> s
'A unicode \xc6\x8e string \xc3\xb1' ## bytes of utf-8 encoding
> t = unicode(s, 'utf-8') ## Convert bytes back to a unicode string
> t == unistring ## It's the same as the original, yay!
True


Files Unicode

import codecs

f = codecs.open('foo.txt', 'rU', 'utf-8')
for line in f:
# here line is a *unicode* string

tom tom
Answer

This answer is about Python 2. In Python 3, str is a Unicode string.

Python's str type is a collection of 8-bit characters. The Latin alphabet can be represented using these 8-bit characters, but symbols such as ±, ♠, Ω and ℑ cannot.

Unicode is a standard for working with a wide range of characters. Each symbol has a codepoint (a number), and these codepoints can be encoded (converted to a sequence of bytes) using a variety of encodings.

UTF-8 is one such encoding. The low codepoints are encoded using a single byte, and higher codepoints are encoded as sequences of bytes.

Python's unicode type is a collection of codepoints. The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

When the Python interpreter displays the value of ustring, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.

The line s = unistring.encode('utf-8') encodes the Unicode string using UTF-8. This converts each codepoint to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str. The size of s is 22 bytes, because two of the characters have high codepoints and are encoded as a sequence of two bytes rather than a single byte.

When the Python interpreter displays the value of s, it escapes four bytes that are not in the printable range (\xc6, \x8e, \xc3, and \xb1). The two pairs of bytes are not treated as single characters like before because s is of type str, not unicode.

The line t = unicode(s, 'utf-8') does the opposite of encode(). It reconstructs the original codepoints by looking at the bytes of s and parsing byte sequences. The result is a Unicode string.

The call to codecs.open() specifies utf-8 as the encoding, which tells Python interpret the content of the file (which is a collection of bytes) as a Unicode string that has been encoded using UTF-8.