Automatic charset detection is not reliable
dcormier opened this issue · comments
Related to issue #81 (and its PR, #87), the automatic character set that was added to resolve that issue is not reliable. I'm not able to share details at the moment, but I have an email (that I unfortunately can't share right now) in gbk
charset that is being incorrectly detected and decoded as utf-8
(with 100% confidence), resulting in a mangled mess of bytes.
I'm working on some kind of solution. I'm first going to investigate if chardet
gives us the declared charset as one with lesser confidence and go from there.
I'd definitely support some sort of override table for (declared) charsets we know to be detected unreliably.
Given #132 is merged, will close until we can get more data showing we are still failing here.