I get a link from a web page by using beautiful soup library through
# -*- coding: utf-8 -*-
r = requests.get(url)
soup = BeautifulSoup(r.text)
Do not use
r.text; leave decoding to
soup = BeautifulSoup(r.content)
What happens is that the server did not include the character-set in the response headers. At that moment,
requests follows the HTTP RFC 2261, section 3.7.1:
text/ responses by default are expected to use the ISO-8859-1 (Latin 1) character set.
For your HTML page, that default is wrong, and you got incorrect results;
r.text decoded the bytes as Latin-1, resulting in a Mojibake:
>>> print u'®'.encode('utf8').decode('latin1') Â®
HTML can itself include the correct encoding in the HTML page itself, in the form of a
<meta> tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.
Even if the
<meta> header tag is missing, BeautifulSoup includes other methods to auto-detect encodings.