Automatic charset detection is not reliable

Question

Automatic charset detection is not reliable

dcormier opened this issue 6 years ago · comments

Related to issue #81 (and its PR, #87), the automatic character set that was added to resolve that issue is not reliable. I'm not able to share details at the moment, but I have an email (that I unfortunately can't share right now) in gbk charset that is being incorrectly detected and decoded as utf-8 (with 100% confidence), resulting in a mangled mess of bytes.

I'm working on some kind of solution. I'm first going to investigate if chardet gives us the declared charset as one with lesser confidence and go from there.

James Hillyerd · Answer 1 · Sun Dec 09 2018 03:24:17 GMT+0800 (China Standard Time)

I'd definitely support some sort of override table for (declared) charsets we know to be detected unreliably.

Neil · Answer 2 · Fri Aug 30 2019 05:50:42 GMT+0800 (China Standard Time)

@dcormier this may be related to #131 for not having enough input to reach a determination of charset

James Hillyerd · Answer 3 · Mon Jan 20 2020 06:24:15 GMT+0800 (China Standard Time)

Given #132 is merged, will close until we can get more data showing we are still failing here.