user2333196 user2333196 - 1 year ago 62
Python Question

Beautiul Soup returning strange characters (chinesse)

Using Python3 and BeautifulSoup v4

soup = BeautifulSoup(r.content, "html.parser")

returns what you would expect


for this url, similar page but different game

the same code returns this


Answer Source

I can replicate using .content, why it is happening is because of the following meta tag, the charset is set to UTF-16:

<META http-equiv="Content-Type" content="text/html; charset=UTF-16">

A workaround is to specify the from_encoding as utf-8:

soup = BeautifulSoup(r.content,"lxml", from_encoding="utf-8")

You can also decode the bytes:

soup = BeautifulSoup(r.content.decode("utf-8"))

If you print the headers, you can see 'Content-Type': 'text/html; Charset=UTF-8', the data is actually utf-8 encoded but the meta tag is incorrect.

If we get the content, decode it ourselves and print a slice you can see it is in fact getting decoded to utf-16 by bs4:

In [1]: import requests

In [2]: r = requests.get("")

In [3]: print(r.content.decode("utf-16")[:10])

In [4]: print(r.content.decode("utf-8")[:10])
