jhillyerd / enmime

MIME mail encoding and decoding package for Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Declared charset cannot always be trusted

nerdlich opened this issue · comments

Unfortunately, you cannot always trust the charset that is defined in a Content-Type header. We've seen cases where enmime essentially corrupts part.Content by decoding a declared gbk charset where the data in fact was utf-8.

Instead of taking for granted whatever is declared in the header, one could ask a charset detection library like https://github.com/saintfish/chardet. I'll see if I can prepare a PR. Let me know what you think.

Sounds like a good idea to me. I do wonder, how would we decide which to trust?

The results of this library have a "confidence" field. No idea how that's calculated though ;)

To be really sure we could do a full decode - encode cycle with the detected charset and compare the input and output bytes. But that's maybe too much overhead.

Confidence scoring sounds like it should work for our use case.

I agree that doing multple decode/encode cycles would be too expensive to do all the time in enmime. What may make more sense is to build an external tool that would do that test over a large corpus of email (if you have one, I need to find one...), allowing us to input a confidence score threshold, then we could see what the failure rate was given that threshold.

commented

I've tried the Enron data set (https://www.cs.cmu.edu/~enron/), but not only is the data very old, it also doesn't contain any interesting charsets at all, mainly us-ascii, utf-8, iso-8859-1.

I might be able to gather a body of emails but only for a private test, not to share publicly.

Did some more tests now with a body of ~3GB of emails received in China, Japan, Korea, Pakistan, UAE, Russia, Bulgaria, Greece, hence lots of potentially problematic charsets involved. You get a lot of false positives and boring stuff (e.g. UTF-8 instead of ISO-8859-1) with lower confidence scores, but with confidence >90, results look very promising. The majority of cases is GBK or GB2312 detected as GB18030 which according to my understanding is a superset of the two former, so in most cases probably not a real mismatch, but also not wrong to use GB18030. But there are also a few true positives like GB2312 correctly detected as UTF-8 or us-ascii correctly detected as ISO-2022-JP.

To leave all options open for the users of enmime we could also keep both, "OrigContent" and "OrigCharset" (as specified), and "Content" and "Charset" (as detected) in the part struct.

I've opened #90 to track giving clients some way to configure enmime, until then, lets see how the hard-coded confidence setting works out for folks.

Related to email corpus testing, I opened #92; as I think it would be nice to have a standard tool to automate it.