NARAYAN CHANGDER NARAYAN CHANGDER - 1 month ago 18
Python Question

Error when encoding UTF-8

I am trying to fetch text data from a website, but this code shows some error. Please let me know where is the error.

import requests

from bs4 import BeautifulSoup

def getportions(soup):

for p in soup.find_all("p", {"class": ""}):
yield p.text


def readpage(address):
page = requests.get(address)
soup = BeautifulSoup(page.text, "html.parser")
output_text = ''
for s in getportions(soup):
output_text += s.encode("utf8")
output_text += "\n"
print (output_text)
print ("End of article")
fp = open("content.txt", "w")
fp.write(output_text)
if __name__ == "__main__":
readpage("http://yahoo.com")


The error is shown below:


output_text += s.encode("utf8"). TypeError: Can't convert 'bytes' object to str implicitly

Answer

If you use Python 3, all strings are natively in unicode, and you can specify the encoding when opening a file. You code could become:

def readpage(address):   
   ...
   output_text = ''
   for s in getportions(soup):
      output_text += s
      output_text += "\n"
   print (output_text)
   print ("End of article")
   fp = open("content.txt", "w", encoding='utf8')
   fp.write(output_text)

If you simply want to sanitize the text by replacing all non ascii characters with a ? open the file that way:

   fp = open("content.txt", "w", encoding='ascii', errors='replace')