logcat logcat - 10 months ago 39
Python Question

Converting Python's bytes object to a string causes data inside html to disappear

I’m trying to read HTML content and extract only the data (such as the lines in a Wikipedia article). Here’s my code in Python:

import urllib.request
from html.parser import HTMLParser

urlText = []

#Define HTML Parser
class parseText(HTMLParser):
def handle_data(self, data):
if data != '\n':

def main():

thisurl = "https://en.wikipedia.org/wiki/Python_(programming_language)"
#Create instance of HTML parser (the above class)
lParser = parseText()
#Feed HTML file into parser. The handle_data method is implicitly called.
with urllib.request.urlopen(thisurl) as url:
htmlAsBytes = url.read()
htmlAsString = htmlAsBytes.decode(encoding="utf-8")
#for item in urlText:

I do get the HTML content from the webpage and if I print the bytes object returned by the read() method, it looks like I receive all the HTML content of the webpage. However, when I try to parse this content to get rid of the tags and store only the readable data, I’m not getting the result I expect at all.

The problem is that in order to use the feed() method of the parser, one has to convert the bytes object to a string. To do that you use the decode() method, which receives the encoding with which to do the conversion. If I print the decoded string, the content printed doesn’t contain the data itself (the useful readable data I’m trying to extract). Why does that happen and how can I solve this?

Note: I'm using Python 3.

Thanks for the help.

Answer Source

All right, I eventually used beautifulsoup to do the job, as Alden recommended, but I still don't know why the decoding process mysteriously gets rid of the data.