Here is how I open, read and output. The file is an UTF-8 encoded file for unicode characters. I want to print the first 10 UTF-8 characters, but the output from below code snippet print 10 weird unrecognized characters. Wondering if anyone have any ideas how to print correctly? Thanks.
with open(name, 'r') as content_file:
content = content_file.read()
for i in range(10):
When Unicode codepoints (characters) are encoded as UTF-8 some codepoints are converted to a single byte, but many codepoints become more than one byte. Characters in the standard 7 bit ASCII range will be encoded as single bytes, but more exotic characters will generally require more bytes to encode.
So you are getting those weird characters because you are breaking up those multi-byte UTF-8 sequences into single bytes. Sometime those bytes will correspond to normal printable characters, but often they won't so you get � printed instead.
Here's a short demo using the ©, ®, and ™ characters, which are encoded as 2, 2, and 3 bytes respectively in UTF-8. My terminal is set to use UTF-8.
utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2" print utfbytes, len(utfbytes) for b in utfbytes: print b, repr(b) uni = utfbytes.decode('utf-8') print uni, len(uni)
© ® ™ 9 � '\xc2' � '\xa9' ' ' � '\xc2' � '\xae' ' ' � '\xe2' � '\x84' � '\xa2' © ® ™ 5
Stack Overflow co-founder, Joel Spolsky, has written a good article on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Here's a short example of extracting individual characters from a UTF-8 encoded byte string. As I mention in the comments, to do this correctly you need to know how many bytes each of the characters is encoded as.
utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2" widths = (2, 1, 2, 1, 3) start = 0 for w in widths: print "%d %d [%s]" % (start, w, utfbytes[start:start+w]) start += w
0 2 [©] 2 1 [ ] 3 2 [®] 5 1 [ ] 6 3 [™]
FWIW, here's a Python 3 version of that code:
utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2" widths = (2, 1, 2, 1, 3) start = 0 for w in widths: s = utfbytes[start:start+w] print("%d %d [%s]" % (start, w, s.decode())) start += w
If we don't know the byte widths of the characters in our UTF-8 string then we need to do a little more work. Each UTF-8 sequence encodes the width of the sequence in the first byte, as described in the Wikipedia article on UTF-8.
The following Python 2 demo shows how you can extract that width information; it produces the same output as the two previous snippets.
# UTF-8 code widths #width starting byte #1 0xxxxxxx #2 110xxxxx #3 1110xxxx #4 11110xxx #C 10xxxxxx def get_width(b): if b <= '\x7f': return 1 elif '\x80' <= b <= '\xbf': #Continuation byte raise ValueError('Bad alignment: %r is a continuation byte' % b) elif '\xc0' <= b <= '\xdf': return 2 elif '\xe0' <= b <= '\xef': return 3 elif '\xf0' <= b <= '\xf7': return 4 else: raise ValueError('%r is not a single byte' % b) utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2" start = 0 while start < len(utfbytes): b = utfbytes[start] w = get_width(b) s = utfbytes[start:start+w] print "%d %d [%s]" % (start, w, s) start += w
Generally, it should not be necessary to do this sort of thing: just use the provided decoding methods.