Mitchell Peterson Mitchell Peterson - 1 month ago 13
Python Question

Beautiful Soup using html.parser having troubles decoding quotation marks

I have a simple program to grab the text of an article from Fox News, but for some reason I am having troubles getting the quotation marks to be decoded correctly.

from bs4 import BeautifulSoup
import urllib

r = urllib.urlopen('http://www.foxnews.com/politics/2016/10/14/emails-reveal-clinton-teams-early-plan-for-handling-bill-sex-scandals.html').read()
soup = BeautifulSoup(r, 'html.parser')

for item in soup.find_all('div', class_='article-text'):
print item.get_text().encode('UTF-8')


This grabs the text I am looking for, but for almost all quotation marks in the article they are printed like this: Bill Clinton’s. I have tried specifically defining the decoding to be in utf-8 and have looked at the page to see what encoding it declares and it is utf-8 as well so I am not sure why this is happening.

Answer

So this does not solve why Beautiful Soup was having issues decoding the text, but I have found two roundabout ways to solve the issue. One is to declare an encoding at the top of the script:

      # This Python file uses the following encoding: utf-8

The other is to decode and remove all Unicode characters, then encode again with ascii.

print(temp.decode('unicode_escape').encode('ascii','ignore'))
Comments