crystal-lang / crystal

The Crystal Programming Language

Home Page:https://crystal-lang.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crystal doesn't UTF-8-Validate first byte of input

BlobCodes opened this issue · comments

Bug Report

The following code compiles fine, even though the macro generates invalid UTF-8:

{{ "\xFF = 2".id }}

This only works if the first character of any input is invalid UTF-8. If any other character is invalid, an exception is raised:

{{ "\xFF\xFE = 2".id }}
# Unexpected byte 0xfe at position 1, malformed UTF-8 (InvalidByteSequenceError)
#   from /crystal/src/compiler/crystal/syntax/lexer.cr:2759:9 in '??'
#   from /crystal/src/compiler/crystal/syntax/lexer.cr:1057:11 in 'next_token'
#   from /crystal/src/enum.cr:361:3 in 'parse_macro_source'
#   from /crystal/src/compiler/crystal/semantic/semantic_visitor.cr:359:23 in 'expand_inline_macro'
#   from /crystal/src/compiler/crystal/semantic/semantic_visitor.cr:431:3 in 'accept'
#   from /crystal/src/enumerable.cr:510:7 in '??'
#   from /crystal/src/compiler/crystal/syntax/visitor.cr:27:12 in 'accept'
#   from /crystal/src/compiler/crystal/semantic.cr:70:7 in 'semantic:cleanup'
#   from /crystal/src/compiler/crystal/compiler.cr:201:14 in 'compile:combine_rpath'
#   from /crystal/src/compiler/crystal/compiler.cr:195:56 in 'compile:combine_rpath'
#   from /crystal/src/compiler/crystal/command/eval.cr:30:5 in 'eval'
#   from /crystal/src/compiler/crystal/command.cr:126:12 in 'run'
#   from /crystal/src/compiler/crystal.cr:11:1 in '__crystal_main'
#   from /crystal/src/crystal/main.cr:129:5 in 'main'
#   from src/env/__libc_start_main.c:95:2 in 'libc_start_main_stage2'
# Error: you've found a bug in the Crystal compiler. Please open an issue, including source code that will allow us to reproduce the bug: https://github.com/crystal-lang/crystal/issues

Oh, and this "you've found a bug in the Crystal compiler" message should probably also be fixed.


$ crystal -v

Crystal 1.12.1 [4cea10199] (2024-04-11)

LLVM: 15.0.7
Default target: x86_64-unknown-linux-gnu

Oh, and this "you've found a bug in the Crystal compiler" message should probably also be fixed.

What fixing does it need?

I just meant that macros generating invalid UTF-8 shouldn't result in a "compiler bug" message because it's user error.

Note the same error appears when the first byte of the source file is invalid UTF-8 encoding.

$ echo '\xFF' | bin/crystal eval
Using compiled compiler at .build/crystal
Regex match error: UTF-8 error: illegal byte (0xfe or 0xff) (ArgumentError)
  from src/regex/pcre2.cr:275:9 in 'match_data'
  from src/regex/pcre2.cr:207:18 in 'match_impl'
  from src/regex.cr:672:12 in 'match_at_byte_index'
  from src/regex.cr:621:12 in 'match:options'
  from src/string.cr:3227:13 in '=~'
  from src/compiler/crystal/semantic/suggestions.cr:41:25 in 'lookup_similar_def'
  from src/compiler/crystal/semantic/suggestions.cr:73:7 in 'lookup_similar_def_name'
  from src/compiler/crystal/semantic/call_error.cr:594:5 in 'raise_undefined_method'
  from src/compiler/crystal/semantic/call_error.cr:98:7 in 'raise_matches_not_found'
  from src/compiler/crystal/semantic/call.cr:291:9 in 'lookup_matches_in_type'
  from src/compiler/crystal/semantic/call.cr:254:3 in 'lookup_matches_in_type:search_in_parents:with_autocast'
  from src/compiler/crystal/semantic/call.cr:210:5 in 'lookup_matches_in'
  from src/compiler/crystal/semantic/call.cr:209:3 in 'lookup_matches_in:with_autocast'
  from src/compiler/crystal/semantic/call.cr:197:7 in 'lookup_matches_without_splat'
  from src/compiler/crystal/semantic/call.cr:124:17 in 'lookup_matches:with_autocast'
  from src/compiler/crystal/semantic/call.cr:113:5 in 'lookup_matches'
  from src/compiler/crystal/semantic/call.cr:90:15 in 'recalculate'
  from src/compiler/crystal/semantic/main_visitor.cr:1380:7 in 'recalculate_call'
  from src/compiler/crystal/semantic/main_visitor.cr:1359:7 in 'visit'
  from src/compiler/crystal/syntax/visitor.cr:27:12 in 'accept'
  from src/compiler/crystal/semantic/main_visitor.cr:688:11 in 'visit'
  from src/compiler/crystal/syntax/visitor.cr:27:12 in 'accept'
  from src/compiler/crystal/semantic/main_visitor.cr:6:7 in 'visit_main:process_finished_hooks:cleanup:visitor'
  from src/compiler/crystal/progress_tracker.cr:22:7 in 'semantic:cleanup'
  from src/compiler/crystal/compiler.cr:219:14 in 'compile:combine_rpath'
  from src/compiler/crystal/compiler.cr:213:56 in 'compile:combine_rpath'
  from src/compiler/crystal/command/eval.cr:29:5 in 'eval'
  from src/compiler/crystal/command.cr:101:7 in 'run'
  from src/compiler/crystal/command.cr:55:5 in 'run'
  from src/compiler/crystal/command.cr:54:3 in 'run'
  from src/compiler/crystal.cr:11:1 in '__crystal_main'
  from src/crystal/main.cr:129:5 in 'main_user_code'
  from src/crystal/main.cr:115:7 in 'main'
  from src/crystal/main.cr:141:3 in 'main'
  from /lib/x86_64-linux-gnu/libc.so.6 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
  from /home/johannes/src/crystal-lang/crystal/.build/crystal in '_start'
  from ???
Error: you've found a bug in the Crystal compiler. Please open an issue, including source code that will allow us to reproduce the bug: https://github.com/crystal-lang/crystal/issues

It's very peculiar that the compiler advances as far as into a regex match to look for similar names until it notices something is wrong.

If an invalid encoding is in any later byte, the compiler errors gracefully:

$ echo '-\xFF' | crystal eval
Error: file 'eval' is not a valid Crystal source file: Unexpected byte 0xff at position 1, malformed UTF-8

The expected error is only raised inside Crystal::Lexer#next_char_no_column_increment after a call to Char::Reader#next_char; this needs to be done in Crystal::Lexer#initialize as well