Over the past few days I have been attempting to create a script which would 1) extract the XML from a Word document, 2) modify that XML, and 3) use the new XML to create and save a new Word document. With the help of many stackoverflow users I was eventually able to find code that looks very promising. Here it is:
zip = zipfile.ZipFile(open(docxFilename,"rb"))
tmpDir = tempfile.mkdtemp()
zip = zipfile.ZipFile(open(originalDocx,"rb"))
with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
filenames = zip.namelist()
zipCopyFilename = newFilename
with zipfile.ZipFile(zipCopyFilename,"w") as docx:
for filename in filenames:
The problem is that you are accidentally changing the encoding on
template.docx) is initially encoded as UTF-8 (as is the default encoding for XML documents).
xmlString = zip.read("word/document.xml").decode("utf-8")
However, when you copy it for
template2.docx you are changing the encoding to CP-1252. According to the documentation for
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
You indicated that calling
locale.getpreferredencoding(False) gives you
cp1252 which is the encoding
word/document.xml is being written.
Since you did not explicitly add
<?xml version="1.0" encoding="cp1252"?> to the beginning of
word/document.xml, Word (or any other XML reader) will read it as UTF-8 instead of CP-1252 which is what gives you the illegal XML character error.
So you want to specify the encoding as UTF-8 when writing by using the
encoding argument to
with open(os.path.join(tmpDir, "word/document.xml"), "w", encoding="UTF-8") as f: f.write(xmlString)