logcat logcat - 1 month ago 9
Python Question

Converting Python's bytes object to a string causes data inside html to disappear

I’m trying to read HTML content and extract only the data (such as the lines in a Wikipedia article). Here’s my code in Python:

import urllib.request
from html.parser import HTMLParser

urlText = []


#Define HTML Parser
class parseText(HTMLParser):
def handle_data(self, data):
print(data)
if data != '\n':
urlText.append(data)


def main():

thisurl = "https://en.wikipedia.org/wiki/Python_(programming_language)"
#Create instance of HTML parser (the above class)
lParser = parseText()
#Feed HTML file into parser. The handle_data method is implicitly called.
with urllib.request.urlopen(thisurl) as url:
htmlAsBytes = url.read()
#print(htmlAsBytes)
htmlAsString = htmlAsBytes.decode(encoding="utf-8")
#print(htmlAsString)
lParser.feed(htmlAsString)
lParser.close()
#for item in urlText:
#print(item)


I do get the HTML content from the webpage and if I print the bytes object returned by the read() method, it looks like I receive all the HTML content of the webpage. However, when I try to parse this content to get rid of the tags and store only the readable data, I’m not getting the result I expect at all.

The problem is that in order to use the feed() method of the parser, one has to convert the bytes object to a string. To do that you use the decode() method, which receives the encoding with which to do the conversion. If I print the decoded string, the content printed doesn’t contain the data itself (the useful readable data I’m trying to extract). Why does that happen and how can I solve this?

Note: I'm using Python 3.

Thanks for the help.

Answer

All right, I eventually used beautifulsoup to do the job, as Alden recommended, but I still don't know why the decoding process mysteriously gets rid of the data.