Crystal doesn't UTF-8-Validate first byte of input
BlobCodes opened this issue · comments
Bug Report
The following code compiles fine, even though the macro generates invalid UTF-8:
{{ "\xFF = 2".id }}
This only works if the first character of any input is invalid UTF-8. If any other character is invalid, an exception is raised:
{{ "\xFF\xFE = 2".id }}
# Unexpected byte 0xfe at position 1, malformed UTF-8 (InvalidByteSequenceError)
# from /crystal/src/compiler/crystal/syntax/lexer.cr:2759:9 in '??'
# from /crystal/src/compiler/crystal/syntax/lexer.cr:1057:11 in 'next_token'
# from /crystal/src/enum.cr:361:3 in 'parse_macro_source'
# from /crystal/src/compiler/crystal/semantic/semantic_visitor.cr:359:23 in 'expand_inline_macro'
# from /crystal/src/compiler/crystal/semantic/semantic_visitor.cr:431:3 in 'accept'
# from /crystal/src/enumerable.cr:510:7 in '??'
# from /crystal/src/compiler/crystal/syntax/visitor.cr:27:12 in 'accept'
# from /crystal/src/compiler/crystal/semantic.cr:70:7 in 'semantic:cleanup'
# from /crystal/src/compiler/crystal/compiler.cr:201:14 in 'compile:combine_rpath'
# from /crystal/src/compiler/crystal/compiler.cr:195:56 in 'compile:combine_rpath'
# from /crystal/src/compiler/crystal/command/eval.cr:30:5 in 'eval'
# from /crystal/src/compiler/crystal/command.cr:126:12 in 'run'
# from /crystal/src/compiler/crystal.cr:11:1 in '__crystal_main'
# from /crystal/src/crystal/main.cr:129:5 in 'main'
# from src/env/__libc_start_main.c:95:2 in 'libc_start_main_stage2'
# Error: you've found a bug in the Crystal compiler. Please open an issue, including source code that will allow us to reproduce the bug: https://github.com/crystal-lang/crystal/issues
Oh, and this "you've found a bug in the Crystal compiler" message should probably also be fixed.
$ crystal -v
Crystal 1.12.1 [4cea10199] (2024-04-11)
LLVM: 15.0.7
Default target: x86_64-unknown-linux-gnu
Oh, and this "you've found a bug in the Crystal compiler" message should probably also be fixed.
What fixing does it need?
I just meant that macros generating invalid UTF-8 shouldn't result in a "compiler bug" message because it's user error.
Note the same error appears when the first byte of the source file is invalid UTF-8 encoding.
$ echo '\xFF' | bin/crystal eval
Using compiled compiler at .build/crystal
Regex match error: UTF-8 error: illegal byte (0xfe or 0xff) (ArgumentError)
from src/regex/pcre2.cr:275:9 in 'match_data'
from src/regex/pcre2.cr:207:18 in 'match_impl'
from src/regex.cr:672:12 in 'match_at_byte_index'
from src/regex.cr:621:12 in 'match:options'
from src/string.cr:3227:13 in '=~'
from src/compiler/crystal/semantic/suggestions.cr:41:25 in 'lookup_similar_def'
from src/compiler/crystal/semantic/suggestions.cr:73:7 in 'lookup_similar_def_name'
from src/compiler/crystal/semantic/call_error.cr:594:5 in 'raise_undefined_method'
from src/compiler/crystal/semantic/call_error.cr:98:7 in 'raise_matches_not_found'
from src/compiler/crystal/semantic/call.cr:291:9 in 'lookup_matches_in_type'
from src/compiler/crystal/semantic/call.cr:254:3 in 'lookup_matches_in_type:search_in_parents:with_autocast'
from src/compiler/crystal/semantic/call.cr:210:5 in 'lookup_matches_in'
from src/compiler/crystal/semantic/call.cr:209:3 in 'lookup_matches_in:with_autocast'
from src/compiler/crystal/semantic/call.cr:197:7 in 'lookup_matches_without_splat'
from src/compiler/crystal/semantic/call.cr:124:17 in 'lookup_matches:with_autocast'
from src/compiler/crystal/semantic/call.cr:113:5 in 'lookup_matches'
from src/compiler/crystal/semantic/call.cr:90:15 in 'recalculate'
from src/compiler/crystal/semantic/main_visitor.cr:1380:7 in 'recalculate_call'
from src/compiler/crystal/semantic/main_visitor.cr:1359:7 in 'visit'
from src/compiler/crystal/syntax/visitor.cr:27:12 in 'accept'
from src/compiler/crystal/semantic/main_visitor.cr:688:11 in 'visit'
from src/compiler/crystal/syntax/visitor.cr:27:12 in 'accept'
from src/compiler/crystal/semantic/main_visitor.cr:6:7 in 'visit_main:process_finished_hooks:cleanup:visitor'
from src/compiler/crystal/progress_tracker.cr:22:7 in 'semantic:cleanup'
from src/compiler/crystal/compiler.cr:219:14 in 'compile:combine_rpath'
from src/compiler/crystal/compiler.cr:213:56 in 'compile:combine_rpath'
from src/compiler/crystal/command/eval.cr:29:5 in 'eval'
from src/compiler/crystal/command.cr:101:7 in 'run'
from src/compiler/crystal/command.cr:55:5 in 'run'
from src/compiler/crystal/command.cr:54:3 in 'run'
from src/compiler/crystal.cr:11:1 in '__crystal_main'
from src/crystal/main.cr:129:5 in 'main_user_code'
from src/crystal/main.cr:115:7 in 'main'
from src/crystal/main.cr:141:3 in 'main'
from /lib/x86_64-linux-gnu/libc.so.6 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /home/johannes/src/crystal-lang/crystal/.build/crystal in '_start'
from ???
Error: you've found a bug in the Crystal compiler. Please open an issue, including source code that will allow us to reproduce the bug: https://github.com/crystal-lang/crystal/issues
It's very peculiar that the compiler advances as far as into a regex match to look for similar names until it notices something is wrong.
If an invalid encoding is in any later byte, the compiler errors gracefully:
$ echo '-\xFF' | crystal eval
Error: file 'eval' is not a valid Crystal source file: Unexpected byte 0xff at position 1, malformed UTF-8
The expected error is only raised inside Crystal::Lexer#next_char_no_column_increment
after a call to Char::Reader#next_char
; this needs to be done in Crystal::Lexer#initialize
as well