Possible optimization: emoji (and possibly all extended grapheme clusters?) are treated as two characters

Question

Possible optimization: emoji (and possibly all extended grapheme clusters?) are treated as two characters

leo60228 opened this issue 4 years ago · comments

See the replies at https://twitter.com/leo60228/status/1255605631114936321. The first is made up of 140 emoji, 7 codepoints each. The latter is made up of 140 Tamil letters, 2 codepoints each, though both of those codepoints are light ones.

qntm · Answer 1 · Sun May 03 2020 23:52:13 GMT+0800 (China Standard Time)

These approaches likely can't be used to modify the Base2048 encoding, that's explicitly an 11-bit encoding which is kind of set in stone at the moment... however, there could be some merit here for developing potential successors with better data-per-Tweet efficiency.

In order to beat Base2048 in terms of efficiency, each of these 140 emojis or Tamil letters needs to express more than 22 bits of information. For the Tamil example, that means we need a repertoire of more than 2¹¹ = 2,048 Tamil code points which can be freely joined into pairs at will. I doubt there are that many Tamil code points.

For the emoji example, three of the seven code points are zero-width-joiners, leaving only four to play with. Here, we need a repertoire of more than 2^5.5 code points which can be joined in this manner. Say, 2⁶ = 64 of them? Combined, these would express 24 bits per fully joined emoji, or 420 bytes per Tweet.

That said, scanning this page on the topic suggests that we don't have anywhere near this amount of flexibility. Only around 1100 possible combinations are expressed, total.

leo60228 · Answer 2 · Thu May 07 2020 09:55:39 GMT+0800 (China Standard Time)

A recent proposal for mixed-race family emojis would add 7230 emojis, each of which would presumably be a single heavy Twitter character. I'm not sure if this is significantly more efficient, and deprecating non-default-color families seems more likely.