Unicode characters

Question

Unicode characters

mame opened this issue 3 years ago · comments

Currently, error_highlight does not handle Unicode characters well. There are two subissues.

Ruby::AST::Node#first_column and #last_column seem to return the column in bytes, but String#match handles the index in characters. We need to convert the column indexes.
Some Unicode characters are displayed as two (or more?) columns in a terminal with monospace font.

(1) is relatively simple, but (2) is a bit tough. It requires a table telling how many columns each character has. It is known that Reline has such a table. But because error_highlight is a built-in gem that is loaded at Ruby process invocation, it is not good for error_highlight to depend on Reline (unless we make Reline a special built-in gem). We need to discuss how we make the table available to error_highlight.

Kevin Newton · Answer 1 · Thu Sep 16 2021 01:00:36 GMT+0800 (China Standard Time)

Hey @mame!

I hit this same thing with ripper when I was writing prettier. I ended up solving it by taking the source, splitting it up into multiples lines, and converting each into an object that responded to #[] so that I could get the right indices.

Here are some links to the source:

I hope it's helpful!

Yusuke Endoh · Answer 2 · Thu Sep 16 2021 02:38:18 GMT+0800 (China Standard Time)

Thanks for the information. I think it is about the issue (1) that I said. Yeah, it is solvable by converting the indices.

The tougher issue is (2). Unfortunately, some Unicode characters (mainly Chinese, Japanese, and Korean characters) are rendered as if they have two columns.

あ is one Japanese letter that takes two columns in the terminal. To highlight the letter, we need to put two ^s under the line. To implement this, error_highlight needs a table to tell what character takes two (or more) columns.

Just FYI: To make matters worse, the column count may change depending on a font and a terminal. This issue is called East Asian Width:

Ambiguous width characters are all those characters that can occur as fullwidth characters in any of a number of East Asian legacy character encodings. They have a “resolved” width of either narrow or wide depending on the context of their use.

To be honest, I don't want to face this problem for now 😇

Kevin Newton · Answer 3 · Thu Sep 16 2021 03:03:47 GMT+0800 (China Standard Time)

@mame I see, I think I understand the problem better now. In that case it would probably be nice to have Ruby::AST::Node have methods like {first,last}_character_column or something similar.