你还在用charCodeAt那你就out了

Question

你还在用charCodeAt那你就out了

akira-cn opened this issue 5 years ago · comments

在JavaScript中处理中文和其他Unicode字符时，我们会用到处理Unicode相关的API。

在早期，JavaScript提供的String.prototype.charCodeAt和String.fromCharCode就是能够将字符串转换为Unicode的UTF-16编码以及从UTF-16编码转换为字符串的函数。

比如：

const str = '中文';

console.log([...str].map(char => char.charCodeAt(0)));
// [20013, 25991]

这里我们将字符串展开成单个字符，再通过charCodeAt方法将字符串转换为对应的Unicode编码，这里的20013和25991就是“中文”两个字对应的Unicode编码。

同样，我们可以使用fromCharCode将Unicode编码转换为字符串：

const charCodes = [20013, 25991];

console.log(String.fromCharCode(...charCodes)); // 中文

这两个方法相信大部分同学都不陌生，这是从ES3就开始支持的方法。但是，这个方法在今天我们处理Unicode字符时不够用了。

为什么呢？我们来看一下例子：

const str = '🀄';

console.log(str.charCodeAt(0)); // 55356

这个字符是我们熟悉的麻将中的红中，现在很多输入法都能直接打出来，看上去似乎也正常，没什么问题啊？

可你再试试：

console.log(String.fromCharCode(55356)); // �

实际上Unicode字符🀄的UTF-16编码并不是55356，这时候如果你使用charCodeAt来得到字符🀄的UTF-16编码，应该要到两个值：

const str = '🀄';

console.log(str.charCodeAt(0), str.charCodeAt(1)); // 55356 56324

对应的String.fromCharCode(55356, 56324)才能还原🀄字符。

除此以外，还有其他一些不一样的地方，比如：

console.log('🀄'.length); // 字符串长度为2
'🀄'.split(''); // ["�", "�"] split 出来两个字符
/^.$/.test('🀄'); // false

👉🏻知识点：Unicode标准中，将字符编码的码位以2**16个为一组，组成为一个平面（Plane），按照字符的码位值，分为17个平面，所有码位从0x000000到0x10FFFF，总共使用3个字节。

其中最前面的1个字节是平面编号，从0x0到0x10，一共17个平面。

第0号平面被称为基本多文种平面（BMP，Basic Multilingual Plane），这个平面的所有字符码位只需要16位编码单元即可表示，所以它们可以继续使用UTF-16编码。

其他的平面被称为辅助平面（supplementary plane），这些平面的字符被称为增补字符，它们的码位均超过16位范围。

ES5及之前的JavaScript的Unicode相关API，只能以UTF-16来处理BMP的字符，所有字符串的操作都是基于16位编码单元。

因此，当🀄这样的增补字符出现时，得到的结果就会与预期不符。

在ES2015之后，JavaScript提供了新的API来支持Unicode码位，所以我们可以这么使用：

const str = '🀄';

console.log(str.codePointAt(0)); // 126980

👉🏻 知识点：String.prototype.codePointAt(index) 方法返回字符串指定index位置的字符的Unicode码位，与旧的charCodeAt方法相比，它能够很好地支持增补字符。

对应地，我们有String.fromCodePoint方法将CodePoint转为对应的字符：

console.log(String.fromCodePoint(126980)); // 🀄

Unicode 转义

JavaScript字符串支持Unicode转义，所以我们可以用码位的十六进制字符串加上前缀\u来表示一个字符，例如：

console.log('\u4e2d\u6587'); // 中文

0x4e2d和0x6587分别是20013和25991的十六进制表示。

注意，Unicode转义不仅仅可以用于字符串，实际上\uxxxx也是可以用在标识符，并相互转换的。例如我们可以这么写：

const \u4e2d\u6587 = '测试';

console.log(中文); // 测试

上面的代码我们定义了一个中文变量，声明的时候我们用Unicode转义，console.log的时候用它的变量名字符，这样也是没有问题的。

\u和十六进制字符的这种表示法同样只适用于BMP的字符，所以如果我们试图使用它转义增补字符，直接这样是不行的：

console.log('\u1f004'); // ὆4

这样，引擎会把\u1f004解析成字符\u1f00和阿拉伯数字4组成的字符串。我们需要使用{}将编码包含起来，这样就可以了：

console.log('\u{1f004}'); // 🀄

代理对（surrogate pair）

为区别BMP来表示辅助平面，Unicode引入代理对(surrogate pair)，规定用2个16位编码单元来表示一个码位，具体规则是将一个字符按如下表示：

在BMP内的字符，仍然按照UTF-16的编码规则，使用两个字节来表示。
增补字符使用两组16位编码来表示一个字符规则为：
- 首先将它的编码减去0x10000
- 然后写成 yyyy yyyy yyxx xxxx xxxx 的20位二进制形式
- 然后编码为 110110yy yyyyyyyy 110111xx xxxxxxxx 一共4个字节。

其中110110yyyyyyyyyy和110111xxxxxxxxxx就是两个代理字符，形成一组代理对，其中第一个代理字符的范围从U+D800到U+DBFF，第二个代理字符的范围从U+DC00到U+DFFF。

实现getCodePoint

理解了代理对，我们就可以通过charCodeAt实现getCodePoint了：

function getCodePoint(str, idx = 0) {
  const code = str.charCodeAt(idx);
  if(code >= 0xD800 && code <= 0xDBFF) {
    const high = code;
    const low = str.charCodeAt(idx + 1);
    return ((high - 0xD800) * 0x400) +
      (low - 0xDC00) + 0x10000;
  }
  return code;
}

console.log(getCodePoint('中')); // 20013
console.log(getCodePoint('🀄')); // 126980

同样地，我们也可以通过fromCharCode实现fromCodePoint:

function fromCodePoint(...codePoints) {
  let str = '';
  for(let i = 0; i < codePoints.length; i++) {
    let codePoint = codePoints[i];
    if(codePoint <= 0xFFFF) {
      str += String.fromCharCode(codePoint);
    } else {
      codePoint -= 0x10000;
      const high = (codePoint >> 10) + 0xD800;
      const low = (codePoint % 0x400) + 0xDC00;
      str += String.fromCharCode(high) + String.fromCharCode(low);
    }
  }
  return str;
}

console.log(fromCodePoint(126980, 20013)); // 🀄中

所以我们就可以用上面这样的思路来实现早期浏览器下的polyfill。实际上MDN官方对codePointAt和fromCodePoint的说明中，就按照上面的思路提供了对应的polyfill方法。

getCodePointCount

JavaScript字符串的length只能获得UTF-16字符的个数，所以前面看到的：

console.log('🀄'.length); // 字符串长度为2

要获得Unicode字符数，有几个办法，比如使用spread操作是可以支持Unicode字符串转数组的，所以：

function getCodePointCount(str) {
  return [...str].length;
}
console.log(getCodePointCount('👫中'));

或者使用带有u描述符的正则表达式：

function getCodePointCount(str) {
  let result = str.match(/./gu);
  return result ? result.length : 0;
}
console.log(getCodePointCount('👫中'));

扩展

Unicode码位使用固定的4个字节来编码增补字符，而早期，UTF-8编码则采用可变的1~6个字节来编码Unicode字符。

UTF-8编码方式如下：

字节	起始	终止	byte1	byte2	byte3	byte4	byte5	byte6
1	U+0000	U+007F	0xxxxxxx
2	U+0080	U+07FF	110xxxxx	10xxxxxx
3	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	U+10000	U+1FFFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
5	U+200000	U+3FFFFFF	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
6	U+4000000	U+7FFFFFFF	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx

在浏览器的encodeURIComponent和Node的Buffer默认采用UTF-8编码：

console.log(encodeURIComponent('中')); // %E4%B8%AD

const buffer = new Buffer('中');
console.log(buffer); // <Buffer e4 b8 ad>

这里的E4、B8、AD就是三个字节的十六进编码，我们试着转一下：

const byte1 = parseInt('E4', 16); // 228
const byte2 = parseInt('B8', 16); // 184
const byte3 = parseInt('AD', 16); // 173

const codePoint = (byte1 & 0xf) << 12 | (byte2 & 0x3f) << 6 | (byte3 & 0x3f);

console.log(codePoint); // 20013

我们将三个字节的控制码1110、10、10分别去掉，然后将它们按照从高位到低位的顺序拼接起来，正好就得到'中'的码位20013。

所以我们也可以利用UTF-8编码规则，写另一个版本的通用方法来实现getCodePoint：

function getCodePoint(char) {
  const code = char.charCodeAt(0);
  if(code <= 0x7f) return code;
  const bytes = encodeURIComponent(char)
    .slice(1)
    .split('%')
    .map(c => parseInt(c, 16));
  
  let ret = 0;
  const len = bytes.length;
  for(let i = 0; i < len; i++) {
    if(i === 0) {
      ret |= (bytes[i] & 0xf) << 6 * (len - i - 1);
    } else {
      ret |= (bytes[i] & 0x3f) << 6 * (len - i - 1);
    }
  }
  return ret;
}

console.log(getCodePoint('中')); // 20013
console.log(getCodePoint('🀄')); // 126980

那么同样，我们可以实现fromCodePoint：

function fromCodePoint(point) {
  if(point <= 0xffff) return String.fromCharCode(point);
  const bytes = [];
  bytes.unshift(point & 0x3f | 0x80);
  point >>>= 6;
  bytes.unshift(point & 0x3f | 0x80);
  point >>>= 6;
  bytes.unshift(point & 0x3f | 0x80);
  point >>>= 6;
  if(point < 0x1FFFFF) {
    bytes.unshift(point & 0x7 | 0xf0);
  } else if(point < 0x3FFFFFF) {
    bytes.unshift(point & 0x3f | 0x80);
    point >>>= 6;
    bytes.unshift(point & 0x3 | 0xf8);
  } else {
    bytes.unshift(point & 0x3f | 0x80);
    point >>>= 6;
    bytes.unshift(point & 0x3f | 0x80);
    point >>>= 6;
    bytes.unshift(point & 0x1 | 0xfc);
  }
  const code = '%' + bytes.map(b => b.toString(16)).join('%');
  return decodeURIComponent(code);
}

console.log(fromCodePoint(126980)); // 🀄

关于Unicode，你还有什么想讨论的，欢迎在issue中留言。

迷渡 · Answer 1 · Thu Jul 25 2019 12:36:15 GMT+0800 (China Standard Time)

代码：

[..."👨‍👩‍👧‍👦"] 
// ["👨", "‍", "👩", "‍", "👦", "‍", "👦"]

截图：

PS：大部分输入法可以通过“一家人”来输入这个 emoji

Yvo · Answer 2 · Thu Jul 25 2019 12:59:40 GMT+0800 (China Standard Time)

嗯嗯，这个是emoji的组合字符，这是另外一个规则

Jarvis · Answer 3 · Thu Jul 25 2019 13:45:06 GMT+0800 (China Standard Time)

代码：
[..."👨‍👩‍👧‍👦"] === ["👨", "‍", "👩", "‍", "👦", "‍", "👦"]
截图：

PS：大部分输入法可以通过“一家人”来输入这个 emoji

全等应该是不成立的吧。。数组地址不是一个呢。

迷渡 · Answer 4 · Thu Jul 25 2019 13:47:45 GMT+0800 (China Standard Time)

@Jiasm 是的，数组和数组不想等，我改一下吧。

console.log([..."👨‍👩‍👧‍👦"])
// ["👨", "‍", "👩", "‍", "👦", "‍", "👦"]

HE Shi-Jun · Answer 5 · Fri Jul 26 2019 18:01:06 GMT+0800 (China Standard Time)

Unicode转义不仅仅可以用于字符串，实际上对于JavaScript代码整体都是支持并可以相互转换的

这里有点麻烦，因为ascii以外的其实只能用于标识符。但是读者拿🀄是会试验失败的（var 🀄 = 1不合法），因为不是所有character都可以用于标识符，比如emoji也不行。要展开的话就是一大坨内容了。建议舍弃这块，下次单写一篇吧。

Yvo · Answer 6 · Sat Jul 27 2019 09:27:32 GMT+0800 (China Standard Time)

这个的确是。。。我改一下，这块可以单独拿出来讲。