JakSa JakSa - 3 months ago 51
Python Question

Python 3.x about encoding

# -*- coding: utf-8 -*-

import urllib.request as request

import re

url = "http://jjo.kr/users/38281748"

raw_data = request.urlopen(url).read() #Bytes

decoded = raw_data.decode("utf-8")


I was trying to get HTML info about that url, but I got error messages.

UnicodeEncodeError: 'cp949' codec can't encode character '\ufeff' in position 2313: illegal multibyte sequence

Am I misunderstanding the fuction

According to the Python 3.5.2 Standard Library decode "Return a string decoded from the given bytes.".

But I got cp949 instead of a utf-8 string.

Can anyone tell me what's wrong with my code?


You've got unicode string by decoding the bytes string.

But as you try to print it, python use cp949 encoding (because it's your stdout encoding = sys.stdout.encoding)

There's \ufeff (ZERO WIDTH NO-BREAK SPACE) which cannot be represented in cp949 encoding.

>>> import unicodedata
>>> unicodedata.name('\ufeff')

You can ignore/replace such character by encoding with ignore, replace error-handler.

import sys

decoded = raw_data.decode("utf-8")
decoded = decoded.encode(sys.stdout.encoding, 'ignore').decode(sys.stdout.encoding)