JakSa JakSa - 1 month ago 10
Python Question

Python 3.x about encoding

# -*- coding: utf-8 -*-

import urllib.request as request

import re

url = "http://jjo.kr/users/38281748"

raw_data = request.urlopen(url).read() #Bytes

decoded = raw_data.decode("utf-8")

print(decoded)





I was trying to get HTML info about that url, but I got error messages.


UnicodeEncodeError: 'cp949' codec can't encode character '\ufeff' in position 2313: illegal multibyte sequence


Am I misunderstanding the fuction
decode()
?

According to the Python 3.5.2 Standard Library decode "Return a string decoded from the given bytes.".

But I got cp949 instead of a utf-8 string.

Can anyone tell me what's wrong with my code?

Answer

You've got unicode string by decoding the bytes string.

But as you try to print it, python use cp949 encoding (because it's your stdout encoding = sys.stdout.encoding)

There's \ufeff (ZERO WIDTH NO-BREAK SPACE) which cannot be represented in cp949 encoding.

>>> import unicodedata
>>> unicodedata.name('\ufeff')
'ZERO WIDTH NO-BREAK SPACE'

You can ignore/replace such character by encoding with ignore, replace error-handler.

import sys

decoded = raw_data.decode("utf-8")
decoded = decoded.encode(sys.stdout.encoding, 'ignore').decode(sys.stdout.encoding)
print(decoded)