ashtuchkin / iconv-lite

Convert character encodings in pure javascript.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mismatch of unicode to Big5

aowakennomai opened this issue · comments

Hi,
I just found (at least) one word in Chinese is not correctly encoded from unicode to Big5.
I'm not sure is there any other word has the same problem.
However, it is correct in https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT

0xB05F	0x8D77	# <CJK>

To reproduce the problem:

const char = '起';
const code = char.charCodeAt(0);
console.log(code.toString(16)); // 8d77

const buff = iconv.encode(char, 'big5');
console.log(buff.toString('hex')); // 8ffe ...which is not correct. It should be b05f

my environment is:
nodejs: v12.22.1
npm: v7.9.0
icon-lite@0.6.2

├─┬ body-parser@1.19.0
│ ├── iconv-lite@0.4.24
│ └─┬ raw-body@2.4.0
│   └── iconv-lite@0.4.24
├─┬ eslint@6.8.0
│ └─┬ inquirer@7.3.3
│   └─┬ external-editor@3.1.0
│     └── iconv-lite@0.4.24
├── iconv-lite@0.6.2
├─┬ mssql@5.1.4
│ └─┬ tedious@4.2.0
│   └── iconv-lite@0.4.24
├─┬ mysql2@1.6.5
│ └── iconv-lite@0.4.24
├─┬ node-fetch@1.7.3
│ └─┬ encoding@0.1.13
│   └── iconv-lite@0.6.2 deduped
└─┬ pdfmake@0.1.71
  └── iconv-lite@0.6.2 deduped

Iconv-lite follows WHATWG Encoding standard for encoding/decoding popular encodings. In this particular case see https://encoding.spec.whatwg.org/#big5-encoder

Let's follow the algorithm in the link above. Step 3 is where we get the pointer for the code point in question (0x8D77).
To do that we follow definition in https://encoding.spec.whatwg.org/#index-big5-pointer and proceed to the table https://encoding.spec.whatwg.org/index-big5.txt

In that table there are two pointers for code point 0x8D77: 2354 and 7410. According to definition in https://encoding.spec.whatwg.org/#index-big5-pointer, we should take the first one (see note to step 2).

So back to the main algorithm, given pointer 2354, we calculate lead=14+0x81 and trail=156, which results in big5 encoding 8ffe.
(I might be wrong somewhere here, so let me know if you find a mistake in this calculation).

This is arguably an edge case. I think the encoding you expect here (b05f) comes from that second pointer 7410. I don't have the expertise to understand what is the right value here, so I'm relying on WHATWG standard, which mandates using the first pointer in this case, which results in 8ffe encoding.

If you're sure b05f is more correct, you could create a ticket here https://github.com/whatwg/encoding/issues and I presume actual experts can check this and make corrections to the standard. When that happens, I'll be happy to update iconv-lite as well. As a side effect this'll make all the browsers fix this too, yay!

Hello @ashtuchkin ,

Thanks for your reply.
After your notice, I goto WHATWG's issues and found that it seems WHATWG had been discussed the similar issue.
whatwg/encoding#9

If my understand not wrong, according to that discussion, isn't it should first apply the step1 of https://encoding.spec.whatwg.org/#index-big5-pointer
to exclusive the pointer 2354, because it is less than (0xA1 - 0x81) × 157 (= less than 5024).
And then take the step2 with no-op, because only one pointer (7410) is remaining.
Finally returns 7410 as index.

That's a good point, thank you! It seems that I've missed that condition. I'll look closer and see how I can fix it.

Just released a v0.6.3 with the fix. Let me know if you see any problems. Thanks for your contribution!

Hi, ashtuchkin:
After update to the latest version, it works well now.
Thank you fixing this so rapidly!