h2non / filetype

Fast, dependency-free Go package to infer binary file types based on the magic numbers header signature

Home Page:https://pkg.go.dev/github.com/h2non/filetype?tab=doc

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Matches docx/xlsx as an `application/zip`

DarthPestilane opened this issue · comments

commented
ft, _ := filetype.Match(buf)

buf is read from a docx or xlsx file. But ft.MIME.Value is application/zip

Hi all!
Upload file .xlsx

...
	file, _, _ := r.FormFile("uploadFile")
	defer file.Close()

	fileBytes, _ := ioutil.ReadAll(file)

	k, _ := filetype.Match(fileBytes)
	log.Println(k.Extension, k.MIME)
...

Result
zip {application zip application/zip}

The xlsx matcher just look like this:

func Xlsx(buf []byte) bool {
	return len(buf) > 3 &&
		buf[0] == 0x50 && buf[1] == 0x4B &&
		buf[2] == 0x03 && buf[3] == 0x04 &&
		bytes.Contains(buf[:256], []byte(TypeXlsx.MIME.Value))
}

But raw xlsx content format is zip, so we must unzip the raw content and check the file "[Content_Types].xml" inside it.

The main issue here is that the matchers are stored in a map, which is non-deterministic. Even with xlsx support in a matcher, the order of the matchers used when iterating over the map might first encounter the check for zip (which it is) and consider that a successful match, rather than cascading to recognize that something can be both zip and xlsx (or docx, etc.)

I get the problem with the map being unordered but it should be changed to an array of structs so the order could be defined. The order is important or ".xlsx", ".docx", etc, as weel as "epub" files will be matched as "application/zip" because that is a superset of several other types (many types are just zip files with different extensions).

I was coming here to report an issue with epub files but I saw this equivalent issue. Do you need help implementing this?

I just saw the PR by @ex-nerd. Adding parent types would be a really good solution because it would be a better DX than setting matchers priorities and wouldn't be such a breaking change.

@gotoxu That makes sense. I don't know if the part of the header that contains the Excel mime would always stay uncompressed. Is that be garanteed?

I get the problem with the map being unordered but it should be changed to an array of structs so the order could be defined

The order should be less of a concern now in the world of python 3 (or it could be changed to an OrderedDict and still be compatible with legacy python 2 code while retaining order). However, you still run into problems with "nested" types like zip where you need to do some additional processing and extract part of the file to determine if a file is a file archive, docx, jar, etc.

This is Go, not Python haha

This is Go, not Python haha

Geesch, no wonder I couldn't find which library I ended up using when I was looking for the reply here (it's been too many years and I forgot which project I was researching these issues for).