pgonzaleznetwork pgonzaleznetwork - 5 months ago 20
Python Question

Python-unexpected behavior when I don't decode to utf-8

I have the following function

import urllib.request

def seek():
web = urllib.request.urlopen("http://wecloudforyou.com/")
text = web.read().decode("utf8")
return text
texto = seek()
print(texto)


When I decode to utf-8, I get the html code with indentation and carriage returns and all, just like it's seen on the actual website.

<!DOCTYPE html>
<html>
<head>
<title>We Cloud for You |


If I remove
.decode('utf8')
, I get the code, but the indentation is gone and it's replaced by
\n
.

<!DOCTYPE html>\n<html>\n <head>\n <title>We Cloud for You


So, why is this happening? As far as I know, when you decode, you are basically converting some encoded string into Unicode.

My sys.stdout.encoding is CP1252 (Windows 1252 encoding)

According to this thread: Why does Python print unicode characters when the default encoding is ASCII?


Python outputs non-unicode strings as raw data, without considering
its default encoding. The terminal just happens to display them if its
current encoding matches the data. - Python outputs Unicode strings
after encoding them using the scheme specified in sys.stdout.encoding.
- Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the
terminal's encoding is independant from the shell's.


So, it seems like python needs to read the text in Unicode before it can convert it to CP1252 and then it's printed on the terminal. But I don't understand why if the text is not decoded, it replaces the indentation with
\n
.

sys.getdefaultencoding()
returns utf8.

Answer

In Python 3, when you pass a byte value (raw bytes from the network without decoding) you get to see the representation of the byte value as a Python byte literal. This includes representing newlines as \n characters.

By decoding, you now have a unicode string value instead, and print() can handle that directly:

>>> print(b'Newline\nAnother line')
b'Newline\nAnother line'
>>> print(b'Newline\nAnother line'.decode('utf8'))
Newline
Another line

This is perfectly normal behaviour.

Comments