neilharvey / FileSignatures

A small library for detecting the type of a file based on header signature (also known as magic number).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for non-standard Docx files

anistrate opened this issue · comments

Support for non-standard Docx files

I have recently come upon a word document that the library failed to recognize as such, saying it's a zip.
On first inspection the document seemed normal, I could open it and the content was there.
But when opening the archive, I found that it had "xl\workbook2.xml" instead of the expected workbook.xml.
Yet Word opened it without a problem and a Google search:
open-xml-templating/docxtemplater#366
ruby-docx/docx#83
https://social.msdn.microsoft.com/Forums/office/en-US/9777c217-800d-4cfd-bc6f-8bf8f11dba3a/how-to-change-mainpart-name-?forum=oxmlsdk

suggests that many frameworks and libraries will output such para-standard documents.
Should they also be included as Word files?
Nice work with the library by the way, I’ve used it and it’s a great resource.

Hey, thanks :)

That's actually really interesting, I wasn't aware such discrepancies exist. I've done some quick research and it looks like what we should be doing is looking at [Content Types].xml in the root of the archive and retrieving the part name from there (rather than assuming it's constant):

For example a standard Excel document would have an entry like the following:
<Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>

Could you check your sample which contains xl\workbook2.xml and see if the path is specified in [Content Types].xml? That should confirm whether this is the right change or not.

Having thought about it if we need to extract a file from the archive it could heavily degrade the performance (currently we only scan the entries, which is quick). Possibly a more performant fix would be to alter the search to allow for numbered variations of the main entry which should catch more of these edge cases.

The sample does contain the following line in [Content Types].xml

<Override PartName="/word/document2.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml" />

Would it make sense to, rather than search for a specific name ie "document.xml" to match a certain pattern ie "document*.xml" for word and "workbook*.xml" for excel?
This would not cover all cases, after all, someone could override the "/word/document2.xml" with any file name, but it might cover the case of documents generated by frameworks (of dubious quality).

I can make a pull request with these changes.
Also, given that OfficeOpenXml files are required to have a [Content Types].xml, would it make sense to have this check as well? This would prevent simple archives with a document.xml file somewhere to be classified as word.

Yeah, I think a simple implementation would be to change OfficeOpenXml.cs so that it splits the identifiableEntry into private fields containing the entry without the extension and the extension, then we could update IsMatch so that it does something like:

return archive.Entries.Any(e => e.FullName.StartsWith(identifiableEntryWithoutExtension, StringComparison.OrdinalIgnoreCase) && e.FullName.EndsWith(identifiableEntryExtension, StringComparison.OrdinalIgnoreCase));

That should effectively do the wildcard search. I think checking for [Content Types].xml as well sounds like a sensible idea.

Feel free to send me a PR, otherwise I can have a look at it later :)

Just to explain my thinking - the reason I'm thinking along the lines of doing it this way rather than just changing the entries in the subclasses is that I'd prefer to avoid a breaking change if at all possible!

I've pushed a new release which should fix this :)

That fixed it. Thanks a lot!