Declared charset cannot always be trusted

Question

Declared charset cannot always be trusted

nerdlich opened this issue 6 years ago · comments

Unfortunately, you cannot always trust the charset that is defined in a Content-Type header. We've seen cases where enmime essentially corrupts part.Content by decoding a declared gbk charset where the data in fact was utf-8.

Instead of taking for granted whatever is declared in the header, one could ask a charset detection library like https://github.com/saintfish/chardet. I'll see if I can prepare a PR. Let me know what you think.

James Hillyerd · Answer 1 · Wed Oct 24 2018 22:04:15 GMT+0800 (China Standard Time)

Sounds like a good idea to me. I do wonder, how would we decide which to trust?

nerdlich · Answer 2 · Wed Oct 24 2018 23:01:46 GMT+0800 (China Standard Time)

The results of this library have a "confidence" field. No idea how that's calculated though ;)

To be really sure we could do a full decode - encode cycle with the detected charset and compare the input and output bytes. But that's maybe too much overhead.

James Hillyerd · Answer 3 · Fri Oct 26 2018 00:03:36 GMT+0800 (China Standard Time)

Confidence scoring sounds like it should work for our use case.

I agree that doing multple decode/encode cycles would be too expensive to do all the time in enmime. What may make more sense is to build an external tool that would do that test over a large corpus of email (if you have one, I need to find one...), allowing us to input a confidence score threshold, then we could see what the failure rate was given that threshold.

Neil · Answer 4 · Fri Oct 26 2018 00:22:41 GMT+0800 (China Standard Time)

If there is a specific test you guys want to run, while I cannot provide the email content, I have ~500gb of emails from a variety of sources postfix, exchange, gmail, email, et al. This is a set collected in the US, so charsets won't be too divergent. By February I should get the okay to do an eu release and will be granted access to the gdpr vault.

…

On Thu, Oct 25, 2018, 12:03 PM James Hillyerd ***@***.*** wrote: Confidence scoring sounds like it should work for our use case. I agree that doing multple decode/encode cycles would be too expensive to do all the time in enmime. What may make more sense is to build an external tool that would do that test over a large corpus of email (if you have one, I need to find one...), allowing us to input a confidence score threshold, then we could see what the failure rate was given that threshold. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#81 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANlVUn9-ucl-T-dsTHcuItcVTRKZCd2Dks5uoeDhgaJpZM4X36Rj> .

nerdlich · Answer 5 · Tue Nov 06 2018 16:11:37 GMT+0800 (China Standard Time)

I've tried the Enron data set (https://www.cs.cmu.edu/~enron/), but not only is the data very old, it also doesn't contain any interesting charsets at all, mainly us-ascii, utf-8, iso-8859-1.

I might be able to gather a body of emails but only for a private test, not to share publicly.

nerdlich · Answer 6 · Fri Nov 09 2018 18:14:52 GMT+0800 (China Standard Time)

Did some more tests now with a body of ~3GB of emails received in China, Japan, Korea, Pakistan, UAE, Russia, Bulgaria, Greece, hence lots of potentially problematic charsets involved. You get a lot of false positives and boring stuff (e.g. UTF-8 instead of ISO-8859-1) with lower confidence scores, but with confidence >90, results look very promising. The majority of cases is GBK or GB2312 detected as GB18030 which according to my understanding is a superset of the two former, so in most cases probably not a real mismatch, but also not wrong to use GB18030. But there are also a few true positives like GB2312 correctly detected as UTF-8 or us-ascii correctly detected as ISO-2022-JP.

To leave all options open for the users of enmime we could also keep both, "OrigContent" and "OrigCharset" (as specified), and "Content" and "Charset" (as detected) in the part struct.

James Hillyerd · Answer 7 · Sun Nov 18 2018 07:36:32 GMT+0800 (China Standard Time)

I've opened #90 to track giving clients some way to configure enmime, until then, lets see how the hard-coded confidence setting works out for folks.

James Hillyerd · Answer 8 · Tue Nov 20 2018 04:41:25 GMT+0800 (China Standard Time)

Related to email corpus testing, I opened #92; as I think it would be nice to have a standard tool to automate it.