Matches docx/xlsx as an `application/zip`
DarthPestilane opened this issue · comments
ft, _ := filetype.Match(buf)
buf
is read from a docx or xlsx file. But ft.MIME.Value
is application/zip
Hi all!
Upload file .xlsx
...
file, _, _ := r.FormFile("uploadFile")
defer file.Close()
fileBytes, _ := ioutil.ReadAll(file)
k, _ := filetype.Match(fileBytes)
log.Println(k.Extension, k.MIME)
...
Result
zip {application zip application/zip}
The xlsx matcher just look like this:
func Xlsx(buf []byte) bool {
return len(buf) > 3 &&
buf[0] == 0x50 && buf[1] == 0x4B &&
buf[2] == 0x03 && buf[3] == 0x04 &&
bytes.Contains(buf[:256], []byte(TypeXlsx.MIME.Value))
}
But raw xlsx content format is zip, so we must unzip the raw content and check the file "[Content_Types].xml" inside it.
The main issue here is that the matchers are stored in a map, which is non-deterministic. Even with xlsx support in a matcher, the order of the matchers used when iterating over the map might first encounter the check for zip
(which it is) and consider that a successful match, rather than cascading to recognize that something can be both zip
and xlsx
(or docx
, etc.)
I get the problem with the map being unordered but it should be changed to an array of structs so the order could be defined. The order is important or ".xlsx", ".docx", etc, as weel as "epub" files will be matched as "application/zip" because that is a superset of several other types (many types are just zip files with different extensions).
I was coming here to report an issue with epub
files but I saw this equivalent issue. Do you need help implementing this?
I just saw the PR by @ex-nerd. Adding parent types would be a really good solution because it would be a better DX than setting matchers priorities and wouldn't be such a breaking change.
@gotoxu That makes sense. I don't know if the part of the header that contains the Excel mime would always stay uncompressed. Is that be garanteed?
I get the problem with the map being unordered but it should be changed to an array of structs so the order could be defined
The order should be less of a concern now in the world of python 3 (or it could be changed to an OrderedDict
and still be compatible with legacy python 2 code while retaining order). However, you still run into problems with "nested" types like zip where you need to do some additional processing and extract part of the file to determine if a file is a file archive, docx, jar, etc.
This is Go, not Python haha
This is Go, not Python haha
Geesch, no wonder I couldn't find which library I ended up using when I was looking for the reply here (it's been too many years and I forgot which project I was researching these issues for).