gfngfn / SATySFi

A statically-typed, functional typesetting system

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lexer error cuts input within a codepoint

pandaman64 opened this issue · comments

When the lexer sees a non-ASCII illegal token, it emits only the first byte as the cause of the error, resulting in a mangled message.

How to reproduce

Compile the following (non-conforming) source with satysfi (SATySFi version 0.0.6).

Then we get an error with an invalid codepoint. (In the following output, the invalid codepoint is replaced with U+FFFD � replacement character, but the actual output is the first byte of あ.)

$ satysfi -o /dev/null a.saty
 ---- ---- ---- ----
  target file: 'null'
  dump file: 'a.satysfi-aux' (will be created)
  parsing 'a.saty' ...
! [Syntax Error at Lexer] at "a.saty", line 1, characters 0-1:
    illegal token '�' in a program area

Related Issues

#312 reports an issue with the error position, but the root cause is the same: Unicode-aware treatment of errors.

I think this and #312 could be fixed by reimplementing the lexer with sedlex. Modifications needed would have large conflicts with #294, though.

I have reimplemented SATySFi lexer with sedlex1.

Footnotes

  1. https://github.com/puripuri2100/satysfifmt/blob/master/src/frontend/lexer.ml

I have reimplemented SATySFi lexer with sedlex1.

Awesome! That will be a great starting point to resolve this issue.