Matt Matt - 10 months ago 41
HTML Question

What is the proper method for reading and writing HTML/XML (byte string) with Python and lxml and etree?

EDIT: Now that the problem is solved, I realize that it had more to do with properly reading/writing byte-strings, rather than HTML. Hopefully, that will make it easier for someone else to find this answer.

I have an HTML file that's poorly formatted. I want to use a Python lib to just make it tidy.

It seems like it should be as simple as the following:

import sys
from lxml import etree, html

#read the unformatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html', 'r', encoding='utf-8') as file:
#write the pretty XML to a file
file_text = ''.join(file.readlines())

#format the HTML
document_root = html.fromstring(file_text)
document = etree.tostring(document_root, pretty_print=True)

#write the nice, pretty, formatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/Pretty.html', 'w') as file:
#write the pretty XML to a file

But this chunk of code complains that
is not a string or bytes-like object. Okay, it makes sense that the function can't take a list, I suppose.

But then, it's 'bytes' not a string. No problem,

But then I get HTML that's full of '\n' that are not newlines... they're a slash followed by an en. And there are no actual carriage returns in the result, it's just one long line.

I've tried a number of other weird things like specifying the encoding, trying to decode, etc. None of which produce the desired result.

What's the right way to read and write this kind of (is non-ASCII the right term?) text?

Answer Source

You are missing that you get bytes from tostring method from etree and need to take that into account when writing (a bytestring) to a file. Use the b switch in the open function like this and forget about the str() conversion:

with open('Pretty.html', 'wb') as file:
    #write the pretty XML to a file