If the subject is in Chinese, garbled code appears after parsing

Question

If the subject is in Chinese, garbled code appears after parsing

xigirl opened this issue a year ago · comments

If the subject in the email source code is Chinese, garbled code may appear after parsing, such as 锟斤拷

Subject:发信方已撤回邮件：测试测试
X-QQ-mid: tyyjxt-xx11d002-yh15wt16855880
Date:Thu, 1 Jun 2023 10:53:30 +0800
Content-Type: multipart/mixed;
boundary="----=_NextPart_45518C5C_082708D8_5A03221D"

Andris Reinman · Answer 1 · Thu Jul 20 2023 04:04:47 GMT+0800 (China Standard Time)

The encoding of your input is probably wrong. Are you using utf16 or something else? If parsing it as a Unicode string, then everything works just fine:

const mail = `Subject:发信方已撤回邮件：测试测试
X-QQ-mid: tyyjxt-xx11d002-yh15wt16855880
Date:Thu, 1 Jun 2023 10:53:30 +0800
Content-Type: multipart/mixed;
boundary="----=_NextPart_45518C5C_082708D8_5A03221D"`;

const simpleParser = require('mailparser').simpleParser;
simpleParser(mail, (err, data) => {
    console.log(data.subject);
});

// prints as expected:
// 发信方已撤回邮件：测试测试

xigirl · Answer 2 · Thu Jul 27 2023 15:31:30 GMT+0800 (China Standard Time)

I'm not sure what encoding the sender uses. The subject in the obtained email source code by pop3 is Chinese characters. Through debugging, I found that after these two steps in the processHeaders function, it became garbled and cannot be recovered

let value = ((this.libmime.decodeHeader(line.line) || {}).value || '').toString().trim();
 value = Buffer.from(value, 'binary').toString();

Andris Reinman · Answer 3 · Thu Jul 27 2023 15:47:45 GMT+0800 (China Standard Time)

Mailparser defaults the charset to utf8 for header values if no encoding is specified, so make sure your strings are either regular Unicode strings or use Buffer values with utf8 bytes. If the charset encoding is something else, then the parsing fails.

If you know that the file uses a charset that is not standard, then use a converter module like iconv-lite to convert your input to Unicode before passing this to Mailparser.

Example 1 Input as a regular Unicode string

const mail = `Subject:发信方已撤回邮件：测试测试
X-QQ-mid: tyyjxt-xx11d002-yh15wt16855880
Date:Thu, 1 Jun 2023 10:53:30 +0800
Content-Type: multipart/mixed;
boundary="----=_NextPart_45518C5C_082708D8_5A03221D"`;

const simpleParser = require('mailparser').simpleParser;
simpleParser(mail, (err, data) => {
    console.log(data.subject);
});
// output: 发信方已撤回邮件：测试测试

Example 2 input as a Buffer of UTF8 encoded string

// input as a Buffer of UTF8 encoded string
const mail = Buffer.from(`Subject:发信方已撤回邮件：测试测试
X-QQ-mid: tyyjxt-xx11d002-yh15wt16855880
Date:Thu, 1 Jun 2023 10:53:30 +0800
Content-Type: multipart/mixed;
boundary="----=_NextPart_45518C5C_082708D8_5A03221D"`, 'utf8');

const simpleParser = require('mailparser').simpleParser;
simpleParser(mail, (err, data) => {
    console.log(data.subject);
});
// output: 发信方已撤回邮件：测试测试

For any other input type that Unicode strings or UTF8 encoded buffers, you need to convert the file from source encoding to UTF8.