Unicode characters
mame opened this issue · comments
Currently, error_highlight does not handle Unicode characters well. There are two subissues.
- Ruby::AST::Node#first_column and #last_column seem to return the column in bytes, but String#match handles the index in characters. We need to convert the column indexes.
- Some Unicode characters are displayed as two (or more?) columns in a terminal with monospace font.
(1) is relatively simple, but (2) is a bit tough. It requires a table telling how many columns each character has. It is known that Reline has such a table. But because error_highlight is a built-in gem that is loaded at Ruby process invocation, it is not good for error_highlight to depend on Reline (unless we make Reline a special built-in gem). We need to discuss how we make the table available to error_highlight.
Hey @mame!
I hit this same thing with ripper when I was writing prettier. I ended up solving it by taking the source, splitting it up into multiples lines, and converting each into an object that responded to #[]
so that I could get the right indices.
Here are some links to the source:
- https://github.com/prettier/plugin-ruby/blob/f6a4a6d4299a91692a8b1aa287103d1e7e887b18/src/ruby/parser.rb#L23-L54
- https://github.com/prettier/plugin-ruby/blob/f6a4a6d4299a91692a8b1aa287103d1e7e887b18/src/ruby/parser.rb#L104-L120
- https://github.com/prettier/plugin-ruby/blob/f6a4a6d4299a91692a8b1aa287103d1e7e887b18/src/ruby/parser.rb#L132-L139
I hope it's helpful!
Thanks for the information. I think it is about the issue (1) that I said. Yeah, it is solvable by converting the indices.
The tougher issue is (2). Unfortunately, some Unicode characters (mainly Chinese, Japanese, and Korean characters) are rendered as if they have two columns.
あ
is one Japanese letter that takes two columns in the terminal. To highlight the letter, we need to put two ^
s under the line. To implement this, error_highlight needs a table to tell what character takes two (or more) columns.
Just FYI: To make matters worse, the column count may change depending on a font and a terminal. This issue is called East Asian Width:
Ambiguous width characters are all those characters that can occur as fullwidth characters in any of a number of East Asian legacy character encodings. They have a “resolved” width of either narrow or wide depending on the context of their use.
To be honest, I don't want to face this problem for now 😇
@mame I see, I think I understand the problem better now. In that case it would probably be nice to have Ruby::AST::Node
have methods like {first,last}_character_column
or something similar.