user2333196 user2333196 - 4 months ago 9
Python Question

Beautiul Soup returning strange characters (chinesse)

Using Python3 and BeautifulSoup v4

url='http://www.eurobasket2015.org/en/compID_qMRZdYCZI6EoANOrUf9le2.season_2015.roundID_9322.gameID_9323-C-1-1.html'
r=requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")


returns what you would expect

however

for this url, similar page but different game
http://www.eurobasket2015.org/en/compID_qMRZdYCZI6EoANOrUf9le2.season_2015.roundID_9322.gameID_9323-D-3-1.html

the same code returns this

ℼ佄呃偙⁅瑨汭倠䉕䥌⁃ⴢ⼯㍗⽃䐯䑔䠠䵔⁌⸴㄰吠慲獮瑩潩慮⽬䔯≎∠瑨灴⼺眯睷眮⸳牯⽧剔砯瑨汭⼱呄⽄

Answer

I can replicate using .content, why it is happening is because of the following meta tag, the charset is set to UTF-16:

<META http-equiv="Content-Type" content="text/html; charset=UTF-16">

A workaround is to specify the from_encoding as utf-8:

soup = BeautifulSoup(r.content,"lxml", from_encoding="utf-8")

You can also decode the bytes:

soup = BeautifulSoup(r.content.decode("utf-8"))

If you print the headers, you can see 'Content-Type': 'text/html; Charset=UTF-8', the data is actually utf-8 encoded but the meta tag is incorrect.

If we get the content, decode it ourselves and print a slice you can see it is in fact getting decoded to utf-16 by bs4:

In [1]: import requests

In [2]: r = requests.get("http://www.eurobasket2015.org/en/compID_qMRZdYCZI6EoANOrUf9le2.season_2015.roundID_9322.gameID_9323-D-3-1.html")

In [3]: print(r.content.decode("utf-16")[:10])
ℼ佄呃偙⁅瑨汭倠䉕䥌

In [4]: print(r.content.decode("utf-8")[:10])
<!DOCTYPE