Lin Ma Lin Ma - 4 months ago 28
Python Question

print UTF-8 character in Python 2.7

Here is how I open, read and output. The file is an UTF-8 encoded file for unicode characters. I want to print the first 10 UTF-8 characters, but the output from below code snippet print 10 weird unrecognized characters. Wondering if anyone have any ideas how to print correctly? Thanks.

with open(name, 'r') as content_file:
content = content_file.read()
for i in range(10):
print content[i]


Each of the 10 weird character looks like this,




regards,
Lin

Answer

When Unicode codepoints (characters) are encoded as UTF-8 some codepoints are converted to a single byte, but many codepoints become more than one byte. Characters in the standard 7 bit ASCII range will be encoded as single bytes, but more exotic characters will generally require more bytes to encode.

So you are getting those weird characters because you are breaking up those multi-byte UTF-8 sequences into single bytes. Sometime those bytes will correspond to normal printable characters, but often they won't so you get � printed instead.

Here's a short demo using the ©, ®, and ™ characters, which are encoded as 2, 2, and 3 bytes respectively in UTF-8. My terminal is set to use UTF-8.

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
print utfbytes, len(utfbytes)
for b in utfbytes:
    print b, repr(b)

uni = utfbytes.decode('utf-8')
print uni, len(uni)

output

© ® ™ 9                                                                                                                                        
� '\xc2'                                                                                                                                       
� '\xa9'                                                                                                                                       
  ' '
� '\xc2'
� '\xae'
  ' '
� '\xe2'
� '\x84'
� '\xa2'
© ® ™ 5

Stack Overflow co-founder, Joel Spolsky, has written a good article on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

You should also take a look at the Unicode HOWTO article in the Python docs, and Ned Batchelder's Pragmatic Unicode article, aka "Unipain".


Here's a short example of extracting individual characters from a UTF-8 encoded byte string. As I mention in the comments, to do this correctly you need to know how many bytes each of the characters is encoded as.

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    print "%d %d [%s]" % (start, w, utfbytes[start:start+w])
    start += w

output

0 2 [©]
2 1 [ ]
3 2 [®]
5 1 [ ]
6 3 [™]

FWIW, here's a Python 3 version of that code:

utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    s = utfbytes[start:start+w]
    print("%d %d [%s]" % (start, w, s.decode()))
    start += w

If we don't know the byte widths of the characters in our UTF-8 string then we need to do a little more work. Each UTF-8 sequence encodes the width of the sequence in the first byte, as described in the Wikipedia article on UTF-8.

The following Python 2 demo shows how you can extract that width information; it produces the same output as the two previous snippets.

# UTF-8 code widths
#width starting byte
#1 0xxxxxxx
#2 110xxxxx
#3 1110xxxx
#4 11110xxx
#C 10xxxxxx

def get_width(b):
    if b <= '\x7f':
        return 1
    elif '\x80' <= b <= '\xbf':
        #Continuation byte
        raise ValueError('Bad alignment: %r is a continuation byte' % b)
    elif '\xc0' <= b <= '\xdf':
        return 2
    elif '\xe0' <= b <= '\xef':
        return 3
    elif '\xf0' <= b <= '\xf7':
        return 4
    else:
        raise ValueError('%r is not a single byte' % b)


utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
start = 0
while start < len(utfbytes):
    b = utfbytes[start]
    w = get_width(b)
    s = utfbytes[start:start+w]
    print "%d %d [%s]" % (start, w, s)
    start += w

Generally, it should not be necessary to do this sort of thing: just use the provided decoding methods.