K B K B - 3 months ago 52
C# Question

Simplify/ Clean up XML of a DOCX word document

I have a Microsoft Word Document (docx) and I use Open XML SDK 2.0 Productivity Tool to generate C# code from it.

I want to programmatically insert some database values to the document.
For this I typed in simple text like [[place holder 1]] in the points where my program should replace the placeholders with its database values.

Unfortunately the XML output is in some kind of mess. E.g. I have a table with two neighboring cells, which shouldn't distinguish apart from its placeholder. But one of the placeholders is split
into several runs.

[[good place holder]]

<w:tc xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:tcPr>
<w:tcW w:w="1798" w:type="dxa" />
<w:shd w:val="clear" w:color="auto" w:fill="auto" />
</w:tcPr>
<w:p w:rsidRPr="008C2E16" w:rsidR="001F54BF" w:rsidP="000D7B67" w:rsidRDefault="0009453E">
<w:pPr>
<w:spacing w:after="0" w:line="240" w:lineRule="auto" />
<w:rPr>
<w:rFonts w:cstheme="minorHAnsi" />
<w:sz w:val="20" />
<w:szCs w:val="20" />
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="0009453E">
<w:rPr>
<w:rFonts w:cstheme="minorHAnsi" />
<w:sz w:val="20" />
<w:szCs w:val="20" />
</w:rPr>
<w:t>[[good place holder]]</w:t>
</w:r>
</w:p>
</w:tc>


versus [[bad place holder]]

<w:tc xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:tcPr>
<w:tcW w:w="1799" w:type="dxa" />
<w:shd w:val="clear" w:color="auto" w:fill="auto" />
</w:tcPr>
<w:p w:rsidRPr="008C2E16" w:rsidR="001F54BF" w:rsidP="000D7B67" w:rsidRDefault="00EA211A">
<w:pPr>
<w:spacing w:after="0" w:line="240" w:lineRule="auto" />
<w:rPr>
<w:rFonts w:cstheme="minorHAnsi" />
<w:sz w:val="20" />
<w:szCs w:val="20" />
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="00EA211A">
<w:rPr>
<w:rFonts w:cstheme="minorHAnsi" />
<w:sz w:val="20" />
<w:szCs w:val="20" />
</w:rPr>
<w:t>[[</w:t>
</w:r>
<w:proofErr w:type="spellStart" />
<w:r w:rsidRPr="00EA211A">
<w:rPr>
<w:rFonts w:cstheme="minorHAnsi" />
<w:sz w:val="20" />
<w:szCs w:val="20" />
</w:rPr>
<w:t>bad</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r w:rsidRPr="00EA211A">
<w:rPr>
<w:rFonts w:cstheme="minorHAnsi" />
<w:sz w:val="20" />
<w:szCs w:val="20" />
</w:rPr>
<w:t xml:space="preserve"> place holder]]</w:t>
</w:r>
</w:p>
</w:tc>


Is there any possibility to let Microsoft Word clean up my document, so that all place holders are good to identify in the generated XML?

K B K B
Answer

I have found a solution: the Open XML PowerTools Markup Simplifier.

I followed the steps described at http://ericwhite.com/blog/2011/03/09/getting-started-with-open-xml-powertools-markup-simplifier/, but it didn't work 1:1 (maybe because it is now version 2.2 of Power Tools?). So, I compiled PowerTools 2.2 in "Release" mode and made a reference to the OpenXmlPowerTools.dll in my TestMarkupSimplifier.csproj. In the Program.cs I only changed the path to my DOCX file. I ran the program once and my document seems to be fairly clean now.

Code quoted from Eric's blog in the link above:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Packaging;

class Program
{
    static void Main(string[] args)
    {
        using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true))
        {
            SimplifyMarkupSettings settings = new SimplifyMarkupSettings
            {
                RemoveComments = true,
                RemoveContentControls = true,
                RemoveEndAndFootNotes = true,
                RemoveFieldCodes = false
                RemoveLastRenderedPageBreak = true,
                RemovePermissions = true,
                RemoveProof = true,
                RemoveRsidInfo = true,
                RemoveSmartTags = true,
                RemoveSoftHyphens = true,
                ReplaceTabsWithSpaces = true,
            };
            MarkupSimplifier.SimplifyMarkup(doc, settings);
        }
    }
}