Lexer error cuts input within a codepoint

Question

Lexer error cuts input within a codepoint

pandaman64 opened this issue 2 years ago · comments

When the lexer sees a non-ASCII illegal token, it emits only the first byte as the cause of the error, resulting in a mangled message.

How to reproduce

Compile the following (non-conforming) source with satysfi (SATySFi version 0.0.6).

あ

Then we get an error with an invalid codepoint. (In the following output, the invalid codepoint is replaced with U+FFFD � replacement character, but the actual output is the first byte of あ.)

$ satysfi -o /dev/null a.saty
 ---- ---- ---- ----
  target file: 'null'
  dump file: 'a.satysfi-aux' (will be created)
  parsing 'a.saty' ...
! [Syntax Error at Lexer] at "a.saty", line 1, characters 0-1:
    illegal token '�' in a program area

Related Issues

#312 reports an issue with the error position, but the root cause is the same: Unicode-aware treatment of errors.

OOHASHI Daichi · Answer 1 · Sun Dec 19 2021 20:42:33 GMT+0800 (China Standard Time)

I think this and #312 could be fixed by reimplementing the lexer with sedlex. Modifications needed would have large conflicts with #294, though.

Naoki Kaneko · Answer 2 · Sun Dec 19 2021 20:56:39 GMT+0800 (China Standard Time)

I have reimplemented SATySFi lexer with sedlex¹.

https://github.com/puripuri2100/satysfifmt/blob/master/src/frontend/lexer.ml ↩

OOHASHI Daichi · Answer 3 · Sun Dec 19 2021 21:05:39 GMT+0800 (China Standard Time)

I have reimplemented SATySFi lexer with sedlex1.

Awesome! That will be a great starting point to resolve this issue.

Lexer error cuts input within a codepoint

How to reproduce

Related Issues

Footnotes