user5356756 user5356756 - 7 months ago 120
Python Question

BeautifulSoup chinese character encoding error

I'm trying to identify and save all of the headlines on a specific site, and keep getting what I believe to be encoding errors.

The site is: http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm

the current code is:

holder = {}

url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

soup = BeautifulSoup(url, 'lxml')

head1 = soup.find_all(['h1','h2','h3'])

print head1

holder["key"] = head1


The output of the print is:

[<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>]


I'm reasonably certain that those are unicode characters, but I haven't been able to figure out how to convince python to display them as the characters.

I have tried to find the answer elsewhere. The question that was more clearly on point was this one:
Python and BeautifulSoup encoding issues

which suggested adding

soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))


however that gave me the same error that is mentioned in a comment ("AttributeError: type object 'BeautifulSoup' has no attribute 'BeautifulSoup'")
removing the second '.BeautifulSoup' resulted in a different error ("RuntimeError: maximum recursion depth exceeded while calling a Python object").

I also tried the answer suggested here:
Chinese character encoding error with BeautifulSoup in Python?

by breaking up the creation of the object

html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)


but that also generated the recursion error. Any other tips would be most appreciated.

thanks

Answer

decode using unicode-escape:

In [6]: from bs4 import BeautifulSoup

In [7]: h = """<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>"""

In [8]: soup = BeautifulSoup(h, 'lxml')

In [9]: print(soup.h3.text.decode("unicode-escape"))
环境污染最小化 资源利用最大化

If you look at the source you can see the data is utf-8 encoded:

<meta http-equiv="content-language" content="utf-8" />

For me using bs4 4.4.1 just decoding what urllib returns works fin also:

In [1]: from bs4 import BeautifulSoup

In [2]: import urllib

In [3]: url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

In [4]: soup = BeautifulSoup(url.decode("utf-8"), 'lxml')

In [5]: print(soup.h3.text)
环境污染最小化 资源利用最大化