jhillyerd / enmime

MIME mail encoding and decoding package for Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automatic charset detection is not reliable

dcormier opened this issue · comments

Related to issue #81 (and its PR, #87), the automatic character set that was added to resolve that issue is not reliable. I'm not able to share details at the moment, but I have an email (that I unfortunately can't share right now) in gbk charset that is being incorrectly detected and decoded as utf-8 (with 100% confidence), resulting in a mangled mess of bytes.

I'm working on some kind of solution. I'm first going to investigate if chardet gives us the declared charset as one with lesser confidence and go from there.

I'd definitely support some sort of override table for (declared) charsets we know to be detected unreliably.

commented

@dcormier this may be related to #131 for not having enough input to reach a determination of charset

Given #132 is merged, will close until we can get more data showing we are still failing here.