Mazzy Mazzy - 5 months ago 12
Python Question

Encoding issue of a character in utf-8

I get a link from a web page by using beautiful soup library through

a.get('href')
. In the link there is a strange character
®
but when I get it became
®
. How can I encode it properly? I have already added at the beginning of the page
# -*- coding: utf-8 -*-


r = requests.get(url)

soup = BeautifulSoup(r.text)

Answer

Do not use r.text; leave decoding to BeautifulSoup:

soup = BeautifulSoup(r.content)

r.content gives you the response in bytes, without decoding. r.text on the other hand, is the response decoded to unicode.

What happens is that the server did not include the character-set in the response headers. At that moment, requests follows the HTTP RFC 2261, section 3.7.1: text/ responses by default are expected to use the ISO-8859-1 (Latin 1) character set.

For your HTML page, that default is wrong, and you got incorrect results; r.text decoded the bytes as Latin-1, resulting in a Mojibake:

>>> print u'®'.encode('utf8').decode('latin1')
®

HTML can itself include the correct encoding in the HTML page itself, in the form of a <meta> tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.

Even if the <meta> header tag is missing, BeautifulSoup includes other methods to auto-detect encodings.