qntm / base2048

Binary encoding optimised for Twitter

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Possible optimization: emoji (and possibly all extended grapheme clusters?) are treated as two characters

leo60228 opened this issue · comments

See the replies at https://twitter.com/leo60228/status/1255605631114936321. The first is made up of 140 emoji, 7 codepoints each. The latter is made up of 140 Tamil letters, 2 codepoints each, though both of those codepoints are light ones.

commented

These approaches likely can't be used to modify the Base2048 encoding, that's explicitly an 11-bit encoding which is kind of set in stone at the moment... however, there could be some merit here for developing potential successors with better data-per-Tweet efficiency.

In order to beat Base2048 in terms of efficiency, each of these 140 emojis or Tamil letters needs to express more than 22 bits of information. For the Tamil example, that means we need a repertoire of more than 211 = 2,048 Tamil code points which can be freely joined into pairs at will. I doubt there are that many Tamil code points.

For the emoji example, three of the seven code points are zero-width-joiners, leaving only four to play with. Here, we need a repertoire of more than 25.5 code points which can be joined in this manner. Say, 26 = 64 of them? Combined, these would express 24 bits per fully joined emoji, or 420 bytes per Tweet.

That said, scanning this page on the topic suggests that we don't have anywhere near this amount of flexibility. Only around 1100 possible combinations are expressed, total.

A recent proposal for mixed-race family emojis would add 7230 emojis, each of which would presumably be a single heavy Twitter character. I'm not sure if this is significantly more efficient, and deprecating non-default-color families seems more likely.