neilharvey / FileSignatures

A small library for detecting the type of a file based on header signature (also known as magic number).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Linux | XLSX

muralimaddu opened this issue · comments

I'm using this library fine in our windows dev environment i.e. it is able to recognize xlsx files as excel file (media type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) but in our hosting environment which is Linux OS .xlsx (Excel) files are being treated as zip files (media type: application/zip), I have read that ***x files are treated as zip files in Linux, in our use case we are checking if file uploaded has extension matching to it's file data so if we are trying to upload somesample.xlsx () from windows machine and app is hosted in Linux for that file this library is giving it as zip rather than xlsx in Linux so it's failing.

Anyone can suggest me a solution to make these checks platform agnostic.

Excel xlsx format is actually a zip file, you can actually decompress it and it will be a directory structure containing several xml files. This isn't specific to Linux though, it works the same on Windows as well.

For a minimal workbook, there should always be a workbook.xml file in the xl folder of the decompressed xlsx file (and a sheet in xl/worksheets). So, how we detect an xlsx file from a stream is as follows:

  1. Does it have a zip header (50 4B 03 04)
  2. If so, does the zip file contain an entry for xl/workbook.xml

Since the library is returning that they are zip files, it implies that the first check passes but the second fails.

If you run unzip -l against a sample xlsx file, does contain xl/workbook.xml? I would expect something like this:

unzip -l Test.xlsx
Archive:  Test.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1032  01/01/1980 00:00   [Content_Types].xml
      588  01/01/1980 00:00   _rels/.rels
     2125  01/01/1980 00:00   xl/workbook.xml
      557  01/01/1980 00:00   xl/_rels/workbook.xml.rels
      933  01/01/1980 00:00   xl/worksheets/sheet1.xml
     8390  01/01/1980 00:00   xl/theme/theme1.xml
     1618  01/01/1980 00:00   xl/styles.xml
      615  01/01/1980 00:00   docProps/core.xml
      785  01/01/1980 00:00   docProps/app.xml
---------                     -------
    16643                     9 files

Thanks Neil for quick reply.

I checked and can see xl/workbook.xml when I unzip my sample file (btw we are observing same behavior for docx files as well).

May I know if it is possible to know why it's failing at second step, to emulate the linux I ran my application in docker desktop in local under linux mode (WSL 2) there it identified the file as Excel xslx file.

Docker would have been my first attempt too, I've no idea why it would fail on a Linux server but work on Docker. Current suspects are that either the entries are not in the format I'm expecting (e.g. the copy from Windows -> Linux is mangling the paths perhaps) or that the call to System.IO.Compression.ZipArchive fails on Linux so we can't read the entries at all.

I've created a quick hack to dump the detected format and zip entries; I'll try provisioning a Linux VM and see if it fails.

I've tested this on a Ubuntu 20.04 LTS VM and it seems to work as expected:

user@vm-ubunutu:~/OfficeFileProbe$ dotnet run ~/OfficeFileProbe/Test.xlsx
FileSignatures detected format as:
xlsx [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]

Zip contains 10 entries:
xl/drawings/drawing1.xml
xl/worksheets/sheet1.xml
xl/worksheets/_rels/sheet1.xml.rels
xl/theme/theme1.xml
xl/sharedStrings.xml
xl/styles.xml
xl/workbook.xml
xl/_rels/workbook.xml.rels
_rels/.rels
[Content_Types].xml

Are you able to test on your environment and see if you get different results?

Hi Neil,

Sorry for delay in response, I checked with my devops team and the containers run in Debian GNU/Linux 11 and with mentioned hack to print the zip archive files deployed and when tried to upload xlsx file it went to catch block with error "Unable to read zip entries: Offset to Central Directory cannot be held in an Int64.".. may be same reason it's concluding that it is zip file??

try { using (var zip = new ZipArchive(data)) { _logger.LogInformation($"Zip contains {zip.Entries.Count} entries:"); foreach (var entry in zip.Entries) { _logger.LogInformation(entry.FullName); } } } catch (Exception ex) { _logger.LogInformation($"Unable to read zip entries: {ex.Message}"); }

Okay, yes that would explain why it's not being detected as an xlsx file. Internally we use the ZipArchive class to read the zip entries and look for the item we are interested in. If it fails with an InvalidDataException we return null, the assumption being that it indicates that the file format is invalid.

ZipArchive is part of System.IO.Compression, the source of the exception message is here and is used in two places.

The first is in ReadEndOfCentralDirectory:

// ZipArchive.cs#L560
if (_centralDirectoryStart > _archiveStream.Length)
{
    throw new InvalidDataException(SR.FieldTooBigOffsetToCD);
}

This error is thrown when the offset to central directory record value in the EOCD contains a value which is larger than the length of the stream. It suggests that the EOCD is invalid.

I originally thought that a possibility could be that the stream was reporting it's length as zero, but examining our code we return a null value when the length is zero but since you are receiving zip it can't be going down that path.

The second case is during TryReadZip64EndOfCentralDirectory:

// ZipArchive#622
if (record.OffsetOfCentralDirectory > long.MaxValue)
   throw new InvalidDataException(SR.FieldTooBigOffsetToCD);

This is only used when the zip is in Zip-64 format. This format allows for Excel files larger than 4GB in size, although it appears that some libraries let you specify this mode when creating an Excel file programmatically. The cause of the exception appears to be the same - the EOCD contains an invalid central directory offset.

I don't see anything Linux-specific in the source code, at the moment it looks like it's an issue with the format of the Excel files. Any idea how the files are being created - are they being created via Office, or are they being generated from another system?

Are you able to get hold of one of the files that was uploaded to production to test on your machine and see if the issue can be replicated?

Yes I got the file and on my machine it works fine but failing in hosting environment with aforementioned error, I also tried created xlsx and docx file created via Officeand tried to upload and it's same problem.

I've created a Debian 11 VM in Azure and re-run the test, and it succeeds:

azureuser@vm-debian:~/OfficeFileProbe$ dotnet run Test.xlsx
FileSignatures detected format as:
xlsx [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]

Perhaps I can introduce some more diagnostics to probe at the root cause.