Shekhar Shekhar - 10 days ago 5
C# Question

System.IO.FileFormatException on opening excel worksheet embedded in PowerPoint 2016 with OpenXml SDK

I have PPTX files generated by users with PowerPoint 2016. The slides have embedded excel worksheets which I need to access for further processing. I am using Open Xml SDK v2.6.1 in my project.

On passing the embedded object stream to the SpreadsheetDocument, using the following code:

using (PresentationDocument pd = PresentationDocument.Open(pptxFile, true))
{
foreach (SlidePart slide in pd.PresentationPart.GetPartsOfType<SlidePart>())
{
foreach (EmbeddedObjectPart eoPart in slide.EmbeddedObjectParts)
{
using (SpreadsheetDocument sd = SpreadsheetDocument.Open(eoPart.GetStream(), true))
{
// do some work with worksheets
var count = sd.WorkbookPart.WorksheetParts.Count();
}
}
}
}


I get the following exception:

System.IO.FileFormatException: File contains corrupted data.
at System.IO.Packaging.ZipPackage..ctor(Stream s, FileMode packageFileMode, FileAccess packageFileAccess)
at System.IO.Packaging.Package.Open(Stream stream, FileMode packageMode, FileAccess packageAccess)
at DocumentFormat.OpenXml.Packaging.OpenXmlPackage.OpenCore(Stream stream, Boolean readWriteMode)
at DocumentFormat.OpenXml.Packaging.SpreadsheetDocument.Open(Stream stream, Boolean isEditable, OpenSettings openSettings)
at...


When I open the pptx package and in the embeddings folder rename oleObject1.bin to oleObject1.zip, then see the file information in WinRar, I see that it is SFX Zip volume and not ZipArchive.

The only way I could get the SpreadsheetDocument to open the embedded object stream was to convert the stream to System.IO.Compression.ZipArchive using DotNetZip library.

So I have the following questions:


  1. Is there a way to get Open XML SDK to open embedded excel worksheet stream, without explicit transcoding (from SFX Zip volume to Zip Archive)?

  2. What is the best way to write the modified stream back into the presentation document? This is important because, the worksheet data will be updated and has to be written back to the host document.

  3. Is there another more elegant way to solve this issue?



Note: this issue does not occur when the worksheet is embedded programmatically using OpenXml SDK in the presentation.

Answer

I finally figured out that though a tool like WinRar shows that the embedded object is SFX zip volume, it actually is a MS-CFB (Compound file binary) file.

You can work with CFB files in the following ways:

  1. Windows API: ole32.dll provides methods to read and write CFB files. I found this excellent article on this topic.
  2. There are some useful resources on this page that refer to some open source options.

Bottom line, in order to work with office documents embedded in other office documents as embedded objects, are saved in MS-CFB format. Reading and writing to these files needs to be done outside of Open XML SDK, either using Win API or any other alternative.