Brandon Kuczenski Brandon Kuczenski - 5 months ago 19
Python Question

python: Dowloading and caching XML files - how to handle encoding declaration?

from urllib.request import urlopen
from lxml import objectify


I am trying to write a program that will download XML files into a cache and then open them using
objectify
. If I download the files using
urlopen()
then I can read them in using
objectify.fromstring()
just fine:

r = urlopen(my_url)
o = objectify.fromstring(r.read())


However, if I download them and write them to a file, I end up with an encoding declaration at the top of the file that
objectify
doesn't like. To wit:

# download the file
my_file = 'foo.xml'
r = urlopen(my_url)

# save locally
with open(my_file, 'wb') as fp:
fp.write(r.read())

# open saved copy
with open(my_file, 'r') as fp:
o1 = objectify.fromstring(fp.read())


results in
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.


If I use
objectify.parse(fp)
then that works fine- soo-- I could go through and change all the client code to use
parse()
instead, but I feel like that is not the right approach. I have other XML files stored locally for which
.fromstring()
works just fine-- based on a cursory review they appear to have
utf-8
encoding.

I just don't know what is the right resolution here- should I change the encoding when I save the file? should I strip the encoding declaration? should I fill my code with
try.. except ValueError
clauses? please advise.

Answer

The file needs to be opened in binary mode rather than text mode.

open(my_file, 'rb') # b stands for binary

as suggested by the exception: ... Please use bytes input ...

Comments