alejoss alejoss - 1 year ago 49
Python Question

Python how to solve Unicode Error in string

I'm getting the classical error:


ascii' codec can't decode byte 0xc3 in position 28: ordinal not in
range(128)


This time, I can't solve it. The error comes from this line:

mensaje_texto_inmobiliaria = "%s, con el email %s y el teléfono %s está se ha contactado con Inmobiliar" % (nombre, email, telefono)


Specifically, from the
teléfono
word. I have tried adding
# -*- coding: utf-8 -*-
to the beginning of the file, adding
unicode( <string> )
and also
<string>.encode("utf-8")
. Nothing worked. Any advice will help.

Answer Source

This is in response to why this solves the issue OP is having, and somebackground on the issue OP is trying describe

from __future__ import unicode_literals
from builtins import str

In the default iPython 2.7 kernel :

(iPython session)

In [1]: type("é") # By default, quotes in py2 create py2 strings, which is the same thing as a sequence of bytes that given some encoding, can be decoded to a character in that encoding.
Out[1]: str

In [2]: type("é".decode("utf-8")) # We can get to the actual text data by decoding it if we know what encoding it was initially encoded in, utf-8 is a safe guess in almost every country but Myanmar.
Out[2]: unicode

In [3]: len("é") # Note that the py2 `str` representation has a length of 2.  There's one byte for the "e" and one byte for the accent.  
Out[3]: 2

In [4]: len("é".decode("utf-8")) # the py2 `unicode` representation has length 1, since an accented e is a single character
Out[4]: 1

Some other things of note in python 2.7:

  • "é" is the same thing as str("é")
  • u"é" is the same thing as "é".decode('utf-8') or unicode("é", 'utf-8')
  • u"é".encode('utf-8') is the same thing as str("é")
  • You typically call decode with a py2 str, and encode with py2 unicode.
    • Due to early design issues, you can call both on either even though that doesn't really make any sense.
    • In python3, str, which is the same as python2 unicode, can no longer be decoded since a string by definition is a decoded sequence of bytes. By default, it uses the utf-8 encoding.
  • Byte sequences that were encoded with in the ascii codec behave exactly the same as their decoded counterparts.
    • In python 2.7 with no future imports : type("a".decode('ascii')) gives a unicode object, but this behaves nearly identically with str("a"). This is not the case in python3.

With that said, here's what the snippets above do :

  • __future__ is a module maintained by the core python team that backports python3 functionality to python2 to allow you to use python3 idioms within python2.
  • from __future__ import unicode_literals has the following effect :
    • Without the future import "é" is the same thing as str("é")
    • With the future import "é" is functionally the same thing as unicode("é")
  • builtins is a module that is approved by the core python team, and contains safe aliases for using python3 idioms in python2 with the python3 api.
    • Due to reasons beyond me, the package itself is named "future", so to install the builtins module you run : pip install future
  • from builtins import str has the following effect :
    • the str constructor now gives what you think it does, i.e. text data in the form of python2 unicode objects. So it's functionally the same thing as str = unicode
    • Note : Python3 str is functionally the same as Python2 unicode
    • Note : To get bytes, you can use the "bytes" prefix, e.g. b'é'

The takeaway is this :

  1. Decode on read/Decode early on and encode on write/encode at the end
  2. Use str objects for bytes and unicode objects for text