Matches docx/xlsx as an `application/zip`

Question

Matches docx/xlsx as an `application/zip`

DarthPestilane opened this issue 6 years ago · comments

ft, _ := filetype.Match(buf)

buf is read from a docx or xlsx file. But ft.MIME.Value is application/zip

sonicWhale · Answer 1 · Thu Jun 21 2018 20:33:43 GMT+0800 (China Standard Time)

Hi all!
Upload file .xlsx

...
	file, _, _ := r.FormFile("uploadFile")
	defer file.Close()

	fileBytes, _ := ioutil.ReadAll(file)

	k, _ := filetype.Match(fileBytes)
	log.Println(k.Extension, k.MIME)
...

Result
zip {application zip application/zip}

Xu Qiaolun · Answer 2 · Mon Aug 27 2018 12:43:53 GMT+0800 (China Standard Time)

The xlsx matcher just look like this:

func Xlsx(buf []byte) bool {
	return len(buf) > 3 &&
		buf[0] == 0x50 && buf[1] == 0x4B &&
		buf[2] == 0x03 && buf[3] == 0x04 &&
		bytes.Contains(buf[:256], []byte(TypeXlsx.MIME.Value))
}

But raw xlsx content format is zip, so we must unzip the raw content and check the file "[Content_Types].xml" inside it.

Chris Petersen · Answer 3 · Fri Jul 26 2019 07:04:53 GMT+0800 (China Standard Time)

The main issue here is that the matchers are stored in a map, which is non-deterministic. Even with xlsx support in a matcher, the order of the matchers used when iterating over the map might first encounter the check for zip (which it is) and consider that a successful match, rather than cascading to recognize that something can be both zip and xlsx (or docx, etc.)

Luis Durão · Answer 4 · Sat Oct 03 2020 02:28:33 GMT+0800 (China Standard Time)

I get the problem with the map being unordered but it should be changed to an array of structs so the order could be defined. The order is important or ".xlsx", ".docx", etc, as weel as "epub" files will be matched as "application/zip" because that is a superset of several other types (many types are just zip files with different extensions).

I was coming here to report an issue with epub files but I saw this equivalent issue. Do you need help implementing this?

Luis Durão · Answer 5 · Sat Oct 03 2020 02:30:57 GMT+0800 (China Standard Time)

I just saw the PR by @ex-nerd. Adding parent types would be a really good solution because it would be a better DX than setting matchers priorities and wouldn't be such a breaking change.

Luis Durão · Answer 6 · Sat Oct 03 2020 02:35:39 GMT+0800 (China Standard Time)

@gotoxu That makes sense. I don't know if the part of the header that contains the Excel mime would always stay uncompressed. Is that be garanteed?

Chris Petersen · Answer 7 · Sat Oct 03 2020 03:13:45 GMT+0800 (China Standard Time)

I get the problem with the map being unordered but it should be changed to an array of structs so the order could be defined

The order should be less of a concern now in the world of python 3 (or it could be changed to an OrderedDict and still be compatible with legacy python 2 code while retaining order). However, you still run into problems with "nested" types like zip where you need to do some additional processing and extract part of the file to determine if a file is a file archive, docx, jar, etc.

Luis Durão · Answer 8 · Sat Oct 03 2020 05:14:48 GMT+0800 (China Standard Time)

This is Go, not Python haha

Chris Petersen · Answer 9 · Sat Oct 03 2020 05:24:49 GMT+0800 (China Standard Time)

This is Go, not Python haha

Geesch, no wonder I couldn't find which library I ended up using when I was looking for the reply here (it's been too many years and I forgot which project I was researching these issues for).