Lexer error cuts input within a codepoint
pandaman64 opened this issue · comments
When the lexer sees a non-ASCII illegal token, it emits only the first byte as the cause of the error, resulting in a mangled message.
How to reproduce
Compile the following (non-conforming) source with satysfi (SATySFi version 0.0.6).
あ
Then we get an error with an invalid codepoint. (In the following output, the invalid codepoint is replaced with U+FFFD � replacement character, but the actual output is the first byte of あ.)
$ satysfi -o /dev/null a.saty
---- ---- ---- ----
target file: 'null'
dump file: 'a.satysfi-aux' (will be created)
parsing 'a.saty' ...
! [Syntax Error at Lexer] at "a.saty", line 1, characters 0-1:
illegal token '�' in a program area
Related Issues
#312 reports an issue with the error position, but the root cause is the same: Unicode-aware treatment of errors.
I have reimplemented SATySFi lexer with sedlex1.
Footnotes
I have reimplemented SATySFi lexer with sedlex1.
Awesome! That will be a great starting point to resolve this issue.