Properly measure unicode beyond ascii

Question

Properly measure unicode beyond ascii

JakeWharton opened this issue 5 years ago · comments

Jake Wharton commented 5 years ago

SimpleTextLayout naively assumes char count is width

Jake Wharton commented 4 years ago

2ef687d

Robert W · Answer 1 · Tue Dec 31 2019 23:04:08 GMT+0800 (China Standard Time)

I did some tests here and it mostly works quite well, but making it any better will be really hard. Problem is that not all characters are the same width, even in monospace fonts. This is especially common for CJK fonts but basically means you can never assume single characters will align across all of unicode. Note there is information on this width in the unicode spec which could help, but it's not part of standard Java or kotlin APIs.
Examples of text width; depending on your font, some, all, or none of each block should line up but not all monospace fonts are made equal (hoping github won't mess this up):

latin     | mmmmm |
Half-kana | ﾈﾈﾈﾈﾈ |

Full-latin| ｍｍｍｍｍ |
Full-kana | ネネネネネ |
Emoji     | 😃😃😃😃😃 |
CJK       | 北北北北北 |

I've made a PR for a fix that will measure all characters consistently, but this may make alignment worse for emojis which are generally closer to full width (although often not exactly and font-dependant) and we shouldn't rely on this given there are many full-width characters in BMP.

The better (but potentially horrible to implement) fix would be to use the native font and rendering APIs to measure the text. These are all platform specific (Windows, Android, iOS...) and would require consumers to specify the output font.
But even if you have a pixel size for text it's not clear how it should be aligned just using other characters. Unicode defines a whole bunch of space characters with different widths but support and size will again depend on the font. Probably best to accept this as a know limitation for now

Bonus: Just to make measuring even more impossible, emoji can be modified meaning up to 7 unicode characters can turn into a single glyph (depending on font and OS support).

Robert W · Answer 2 · Tue Dec 31 2019 23:21:04 GMT+0800 (China Standard Time)

Update: character width data is available in icu4j: https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/lang/UProperty.html#EAST_ASIAN_WIDTH. This won't fix any of the above issues with fonts but will be a good indication whether to measure a char as one or two "units".

Yuri Schimke · Answer 3 · Thu Aug 13 2020 05:23:02 GMT+0800 (China Standard Time)

Came here to report this in a kotlin script

Jake Wharton · Answer 4 · Thu Aug 13 2020 07:04:05 GMT+0800 (China Standard Time)

Unfortunately even with proper measurement, emoji rarely conform to monospace properly so the border characters and subsequent columns will always be misaligned.

Yuri Schimke · Answer 5 · Thu Aug 13 2020 15:28:07 GMT+0800 (China Standard Time)

Yep, I went in there to fix it and can see you are already handling it correctly, but it's just the additional width of the those unicode characters.

Jake Wharton · Answer 6 · Wed Sep 16 2020 11:19:49 GMT+0800 (China Standard Time)

The library now handles ANSI escape sequences (which measure to zero) and multi-char codepoints (which measure to one). Going to close for want of specific issues which are not dealing with monospace fonts and their lack of support for emoji and non-western script.