Younghak Jang Younghak Jang - 2 months ago 8
Python Question

Python: for-iteration through a utf-8 string -> what's the data type/encoding of the iterators?

I have a utf-8 encoded strings(mainly Chinese + some english), and want to run a letter count on them. (similar to English word count).

So I used

for letter in text: # text is a utf-8 encoded str

but I'm not sure what I'm getting as a 'letter'. 'text' print in screen fine and write to csv fine. But the 'letter' in 'for letter in text' looks all crashed both on screen and in csv file. I think it's definitely some problem related with encoding, but adding
here and there doesn't solve the problem and returns error like

UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position 0: ordinal not in range(128)

What I mean is the code below doesn't return error but the letters look all crashed, and it returns the above error message when I add .encode('utf-8') to
wcwriter.writerows([[k.encode('utf-8'), v]])

# -*- coding: utf-8 -*-

with open(fname+'.csv', 'wb') as twfile:
twwriter = csv.writer(twfile)
twwriter.writerows([[u'Date/Time', u'Text', u'ID', u'Location', u'City', u'Province']])

for statuses in jres.get('statuses'): # jres is a json format response returned from a API call request
text = statuses.get('text').encode('utf-8')

if keyword in text:
td = statuses.get('created_at').encode('utf-8')
name = statuses.get('user').get('screen_name').encode('utf-8')
loc = statuses.get('user').get('location').encode('utf-8')
city = statuses.get('user').get('city').encode('utf-8')
province = statuses.get('user').get('province').encode('utf-8')

twwriter.writerows([[td, text, name, loc, city, province]])
keycount += 1

# this is the problematic part. I'm not sure exactly what data type or encoding I'm getting for 'letter' below

for letter in text:
if letter not in dismiss:
print letter # this will print a lot of crushed letters
if letter not in wordcount:
wordcount[letter] = 1
wordcount[letter] += 1

with open(wcname+'.csv', 'wb') as wcfile:
wcwriter = csv.writer(wcfile)
wcwriter.writerows([[u'Letter', u'Number']])

for k, v in wordcount.items():
wcwriter.writerows([[k, v]])


UTF-8 encoded bytes may print fine to the screen or write fine to a file, but that's only because both your screen (terminal or console) and whatever reading the file also understand UTF-8.

UTF-8 encoding uses one or more bytes per codepoint. Iteration doesn't step through the data codepoint by codepoint but byte by byte. So the character 'å' is encoded to UTF8 as two bytes, C3 and A5. Trying to handle those two bytes as letters is going to create problems:

>>> 'å'
>>> for byte in 'å':
...     print repr(byte)

You should decode to unicode values so that Python knows about the codepoints encoded by the bytes instead, or where you already have Unicode, not encode:

>>> for codepoint in 'å'.decode('utf8'):
...     print repr(codepoint), codepoint
u'\xe5' å

Your exception is caused when you try to encode already encoded bytes. Python tries to be helpful by first decoding the bytes to Unicode so that it can comply and encode back to bytes, but it can only do so with the default ASCII encoding. That is why you get a UnicodeDecodeError (note the Decode in there) when trying to use encode():

>>> 'å'.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Generally speaking, you want to treat text as Unicode as much as possible. Implement a Unicode sandwich, decode from bytes to Unicode as early as possible, and only encode when you are writing out your data back to a file, as late as possible. The JSON data you are processing is already Unicode, so you only need to encode to UTF8 when producing your CSV rows, but not earlier.

In this case that means you should not encode text:

for statuses in jres.get('statuses'): # jres is a json format response returned from a API call request
    text = statuses['text']

and instead only encode it when you are passing it to the CSV writer:

twwriter.writerows([[td, text.encode('utf8'), name, loc, city, province]])

You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python: