Support for non-standard Docx files

Question

Support for non-standard Docx files

anistrate opened this issue 2 years ago · comments

I have recently come upon a word document that the library failed to recognize as such, saying it's a zip.
On first inspection the document seemed normal, I could open it and the content was there.
But when opening the archive, I found that it had "xl\workbook2.xml" instead of the expected workbook.xml.
Yet Word opened it without a problem and a Google search:
open-xml-templating/docxtemplater#366
ruby-docx/docx#83
https://social.msdn.microsoft.com/Forums/office/en-US/9777c217-800d-4cfd-bc6f-8bf8f11dba3a/how-to-change-mainpart-name-?forum=oxmlsdk

suggests that many frameworks and libraries will output such para-standard documents.
Should they also be included as Word files?
Nice work with the library by the way, I’ve used it and it’s a great resource.

Neil Harvey · Answer 1 · Wed May 11 2022 15:32:47 GMT+0800 (China Standard Time)

Hey, thanks :)

That's actually really interesting, I wasn't aware such discrepancies exist. I've done some quick research and it looks like what we should be doing is looking at [Content Types].xml in the root of the archive and retrieving the part name from there (rather than assuming it's constant):

For example a standard Excel document would have an entry like the following:
<Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>

Could you check your sample which contains xl\workbook2.xml and see if the path is specified in [Content Types].xml? That should confirm whether this is the right change or not.

Neil Harvey · Answer 2 · Wed May 11 2022 15:56:33 GMT+0800 (China Standard Time)

Having thought about it if we need to extract a file from the archive it could heavily degrade the performance (currently we only scan the entries, which is quick). Possibly a more performant fix would be to alter the search to allow for numbered variations of the main entry which should catch more of these edge cases.

anistrate · Answer 3 · Wed May 11 2022 22:16:35 GMT+0800 (China Standard Time)

The sample does contain the following line in [Content Types].xml

<Override PartName="/word/document2.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" />

Would it make sense to, rather than search for a specific name ie "document.xml" to match a certain pattern ie "document*.xml" for word and "workbook*.xml" for excel?
This would not cover all cases, after all, someone could override the "/word/document2.xml" with any file name, but it might cover the case of documents generated by frameworks (of dubious quality).

anistrate · Answer 4 · Wed May 11 2022 22:21:15 GMT+0800 (China Standard Time)

I can make a pull request with these changes.
Also, given that OfficeOpenXml files are required to have a [Content Types].xml, would it make sense to have this check as well? This would prevent simple archives with a document.xml file somewhere to be classified as word.

Neil Harvey · Answer 5 · Wed May 11 2022 22:45:45 GMT+0800 (China Standard Time)

Yeah, I think a simple implementation would be to change OfficeOpenXml.cs so that it splits the identifiableEntry into private fields containing the entry without the extension and the extension, then we could update IsMatch so that it does something like:

return archive.Entries.Any(e => e.FullName.StartsWith(identifiableEntryWithoutExtension, StringComparison.OrdinalIgnoreCase) && e.FullName.EndsWith(identifiableEntryExtension, StringComparison.OrdinalIgnoreCase));

That should effectively do the wildcard search. I think checking for [Content Types].xml as well sounds like a sensible idea.

Feel free to send me a PR, otherwise I can have a look at it later :)

Neil Harvey · Answer 6 · Wed May 11 2022 22:59:52 GMT+0800 (China Standard Time)

Just to explain my thinking - the reason I'm thinking along the lines of doing it this way rather than just changing the entries in the subclasses is that I'd prefer to avoid a breaking change if at all possible!

Neil Harvey · Answer 7 · Mon May 16 2022 03:40:06 GMT+0800 (China Standard Time)

I've pushed a new release which should fix this :)

anistrate · Answer 8 · Thu May 19 2022 20:01:03 GMT+0800 (China Standard Time)

That fixed it. Thanks a lot!