HTML files without doctype detected as plain text, despite extension

Question

HTML files without doctype detected as plain text, despite extension

chmln opened this issue 4 years ago · comments

Here's a reproduction of the bug.

test.html: (incorrect: mime guess is text/plain despite html extension)

<p>test</p>

test_doctype.html (correct: mime guess is text/html)

<!DOCTYPE html>
asdf

Emmanuele Bassi · Answer 1 · Tue Jul 07 2020 20:27:39 GMT+0800 (China Standard Time)

Thanks for your patience; I'll have a look as soon as I can.

If I had to venture a guess, I'd say that test.html with some XML into it ends up matching some rule, and thus the extension gets ignored.

In the meantime, you can always start from a pure file-based guess, and use the content-based one only if the guess result is uncertain.

tpeacock19 · Answer 2 · Tue Jul 06 2021 11:16:58 GMT+0800 (China Standard Time)

is there any plan to address this issue @ebassi ?

Emmanuele Bassi · Answer 3 · Thu Feb 02 2023 15:15:59 GMT+0800 (China Standard Time)

Not really; as I said: an HTML file is not defined to be some plain text file with XML markup thrown in. If you pass <p>foo</p> then the extension takes less of a precedence over some other rule that will look into the file contents.

The appropriate algorithm if you have a file name is:

check if the extension has a high confidence match
if you have some data, check if there's a high confidence match
if the two matches disagree, you will need to figure something out in your code—like presenting a choice of applications to the user

File names and extensions lie all the time: there's no way to rely on something just because of what it says it is.