HTML files without doctype detected as plain text, despite extension
chmln opened this issue · comments
Hi @ebassi
Here's a reproduction of the bug.
test.html
: (incorrect: mime guess is text/plain despite html extension)
<p>test</p>
test_doctype.html
(correct: mime guess is text/html)
<!DOCTYPE html>
asdf
Thanks for your patience; I'll have a look as soon as I can.
If I had to venture a guess, I'd say that test.html
with some XML into it ends up matching some rule, and thus the extension gets ignored.
In the meantime, you can always start from a pure file-based guess, and use the content-based one only if the guess result is uncertain.
is there any plan to address this issue @ebassi ?
Not really; as I said: an HTML file is not defined to be some plain text file with XML markup thrown in. If you pass <p>foo</p>
then the extension takes less of a precedence over some other rule that will look into the file contents.
The appropriate algorithm if you have a file name is:
- check if the extension has a high confidence match
- if you have some data, check if there's a high confidence match
- if the two matches disagree, you will need to figure something out in your code—like presenting a choice of applications to the user
File names and extensions lie all the time: there's no way to rely on something just because of what it says it is.