GaloisInc / saw-script

The SAW scripting language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support MIR string slices

RyanGlScott opened this issue · comments

Currently, the SAW MIR backend supports creating slices of type &[T], but it does not support creating a slice of type &str. In principle, however, SAW could support the latter without too much additional effort. Here is a sketch of how this would work in SAW:

  • We'd need mir_slice_str_value : MIRValue -> MIRValue and mir_slice_str_range_value : MIRValue -> Int -> Int -> MIRValue functions in SAWScript. Each of these functions would take a MIRValue argument representing a reference to an array of bytes (i.e., u8s) and return a MIRValue representing a &str slice.

  • Why an array of bytes? This is a consequence of how crucible-mir desugars &str slices. Specifically, crucible-mir represents a &str slice as a reference to a UTF-8–encoded sequence of bytes. Therefore, taking an array of bytes is the most natural way to interface with crucible-mir.

  • How are users expected to create these arrays of bytes? Most of the time, Cryptol's string literals will be a natural tool for the job. If you write something like "abc" in Cryptol, it will desugar to a sequence of bytes, so "abc" : [3][8]. As luck would have it, that is exactly what we need for crucible-mir.

  • This approach isn't perfect, since Cryptol isn't yet capable of handling Unicode string literals (see this issue). For instance, the expression "roșu" simply doesn't typecheck in Cryptol, since the character 'ș' would require 10 bits to represent instead of 8. Instead, you would have to manually encode the string into UTF-8 and write "ro\200\153u", where "\200\153" is the UTF-8 encoding of the character 'ș'. If we want to do better, we'd need to change Cryptol first.

Another question we will have to answer: should we emulate Rust's ability to check if a string slice over a subrange uses invalid indexing? Take this example from The Rust Programming Language:

let hello = "Здравствуйте";

let s = &hello[0..1];

This will panic at runtime because the Cyrillic letter З requires two bytes in UTF-8, but the range [0..1] only accesses the first of these two bytes, resulting in a slice that doesn't end at a char boundary.

The question: Should mir_str_slice_range_value str_ref 0 1 perform the same check on str_ref's underlying string? Doing so would require using the crucible-mir memory model to emulate the behavior of the is_char_boundary function. Not impossible to do, but likely a bit fiddly. If we opt not to do this, then we should advertise this shortcoming in the SAW manual.