ballercat / walt

:zap: Walt is a JavaScript-like syntax for WebAssembly text format :zap:

Home Page:https://ballercat.github.io/walt/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Discussion: Strings

JobLeonard opened this issue · comments

Feature Request (more of a discussion really)

Overview

When thinking of syntactic sugar for chars (#103) I started thinking of ways to implement strings. Also, @ballercat mentioned in this comment:

I do have plans for at least static strings, so I would like to keep string tokenizing around.

I'm curious what those plans are. There are quite a few design choices to make regarding what kind of strings Walt wants to support, so maybe a discussion is good?

I'll write down my thoughts below, perhaps they're of use to start the discussion.

Impact

Extra-Large.

The consequences of settling on a convention for strings is huge, since it determines what is and isn't idiomatic code for string handling in Walt. It also determines the most common way to do interop with JS strings, so this is not a decision to be made lightly, IMO.

Details

So the seed for these ideas came from #103, which proposed adding syntactic sugar for turning single characters into the equivalent i32 value that calling codePointAt(0) would return. This would simplify interacting with string data passed to Walt code (since that also would have to be turned into numerical data first).

Strings could be handled similarly: as i32[] arrays packing either full code point values, or two UTF16 values at a time. (I'm skipping option of UTF8 encoding for now, as that would require converting JavaScript's UTF16 code points to UTF8 and back. This probably would make things prohibitively slow. Also, UTF16 keeps the strings consistent with the strings in JavaScript, which will likely ease interop. However, see "Syntax" below).

Of course, a plain array is probably a bit too simplistic, because anyone using strings would probably also want to know their length too. A few simple options for handling that:

  • encode them like null-terminated strings in C, with a zero at the end. Of course, there are plenty of issues with that, and we can spare some bytes for overhead
  • use first element of the i32 of the array to represent length, rest of the array as a string
  • do both: length at the start, zero-termination at the end. Why not? Benefit of both worlds at minimal overhead costs.

I think the last option is probably best: if we encode strings as packed UTF16, it's only six bytes of overhead. If stored as full i32 elements for each character, it's eight. Nobody will miss that.

Using a single i32 for string length would limit single strings to length 0x7FFFFFFF, but seriously... that's the equivalent of 2 GiB worth of characters, assuming one byte per character. So that's two or four times that size in WASM. One could do (length>>>1<<1)+(length&1) to compensate for the negative values and get the full 0xFFFFFFFF range of the 32 bits, but if you need more than 2 million characters, I doubt 4 million will suffice either.

(one could also use the MSB as some kind of flag, meaning negative values for the length could be meaningful, but I'm struggling to think of what kind of flag one might need for strings exactly. Maybe a rope-structure where the string isn't zero-terminated but ends with the pointer to the next piece of rope?)

Then the choice would be how to store individual chars:

  • use codePointAt to generate full unicode code points
  • use charCodeAt to generate UTF16 characters, store one per array element
  • use charCodeAt to generate UTF16 characters, store two per i32 array element

Personally I'd probably want to use the two-UTF16-chars-per-element, since it's half the memory overhead for characters. In my experience, when it comes to performance, avoiding the memory wall outweighs a little bit of bitshifting and masking.

However, that would require special bitshifting code all over Walt whenever strings are manipulated, unless even more sugar is introduced, and then we're getting further and further removed from the WASM metal. So instead I think convenience & ease of writing correct code is probably a better default over error-prone optimisations that often are unnecessary (but again, see "syntax" below).

That leaves the two other options. At that point, codePointAt seems more logical, since there is little benefit to using charCodes at this point.

Extracting WASM strings would then require using the ArrayBuffer of the module to generate a Uint32Array, and then generating strings from the section of code representing a string via String.fromCodePoint. Sending would be the same process in reverse.

(It's probably a good idea to create convenience functions for that to facilitate these conversion steps)

Generation

Similar to chars, strings should be easy to generate:

  • initiate the array size at the right length
  • iterate over the string with codePointAt and assign the values to the array (like the of a char)

That's it.

Syntax

So you might think "huh? what syntax? Just use single and double quotes, no?" but I want to backtrack a bit: I argued that for the common case, using i32[] arrays filled with full unicode code points was probably the best idea, since it seems like it would be the least error prone.

However, imagine you're dealing with a piece of text that only has ASCII characters, and that difference in performance and overhead becomes quite significant. Perhaps you'd like a string literal to turn into one that only uses the 8 bits per character required. So then why not use something like:

let utf8String: i32[] = 1;
// 64 chars, or 16 bytes
utf8String= utf8`This will turn into 8-bit chars packed as groups of four, y'all!` 

let utf16String: i32[] = 17;
// 51 utf16 chars, so 26 bytes
utf16String = utf16`Liberté, égalité, fraternité for all utf encodings!`;

let asciiString: i32[] = 43;
asciiString = c`For when you really need that null-terminated, 8-bit string`;

Basically, my idea is to use tagged templates to support different encodings, since I imagine that various contexts have better use for either UTF8, UTF16 or even UTF32 strings.

So the idea is to use single and double quotes for plain strings, whatever convention will end up being used for that, and tagged templates for these specific, but likely, use-cases. This way we can keep everyone happy.

Argh, accidentally hit the submit button! I'm still editing this comment, please wait with reading it!

Ok, ready.

Let me start off by saying thank you for taking the time to write all this down. I enjoyed reading about your perspective on the topic; it's given me some fresh ideas. For example, I always envisioned either a null-terminated string or length encoded version, but you're right we could do both. I did not even consider tagged template literals and the doors it opens up for different encodings!

Before I get into the things I agree with you on(everything?) let me reiterate what my thought process is with pretty much every feature. The ultimate goal with Walt is to open the full feature-set of WebAssembly to the developer via good old JavaScript. To maintain this, I will always shy away from overly opinionated syntax which may hamstring the developer from doing something I have not envisioned initially. What this results in is a bit more manual work for end developer and fewer compiler magic tricks when possible.

Structure

32-bit length followed by string body with an additional null-terminated string should work well. There can be an argument made that the null terminator is unnecessary, but it might provide potential interop with C compiled to WASM though. 32-bits of length is not an issue either as WASM MVP only exposes 32 bits of address space for memory anyway. For consistency sake I could argue that null terminator is unnecessary as well, all binary encoding in wasm uses the 32 bits of length followed by UTF-8 encoded bytes for the string, here is an example. It seems like the null terminator will be a throw away for most use-cases if we have the length encoded. Rust, which is using WASM as a target also does not care about nulls in the characters in the string body(I don't know about rust internals but likely for the same reason).

Anyway, I'm going to make a case that introducing strings actually changes array semantics and that most string problems boil down to array problems.

codePointAt() vs charCodeAt()

I think that codePointAt as the default behavior is the most likely approach. More on how to handle the alternatives later.

Encoding

This is about to get interesting.

String literal in the final binary?

I have some ideas on how to represent the literals in the binary, but I'm not super happy with them, so I would be interested in some fresh perspective.

Let's take a simple string literal "Quick brown fox jumped over the lazy dog." . The literal itself cannot be encoded directly thought, no such thing as strings in WASM natively. Okay, no big deal, these are just some bytes to store somewhere. So the compiler will encode them into the Data section of the binary, which results in some static non-zero bytes in default memory. Data section seems to be designed for this sort of thing. We can now represent the literal as a 32-bit pointer to the starting address in the memory.

Cool, what happens when you assign a literal to a variable?

Should you be able to assign to a variable directly? Like this

const myString: i32[] = "Quick brown fox jumped over the lazy dog.";

... or should you only be able to do the thing you did above, which picks an address and then writes the bytes(copy) starting at that address?

let myString: i32[] = 1;
myString = "Quick brown fox jumped over the lazy dog.";

The latter is feeling safer. Since the literal is a real chunk of memory, you can just go ahead and edit any character in it dynamically in the former...

const myString: i32[] = "Quick brown fox jumped over the lazy dog.";
myString[1] = "X";
const myStringTwo: i32[] = "Quick brown fox jumped over the lazy dog."; // should be the same memory pointer

🤔 myStringTwo will not be what you assigned it to since just mutated that memory, instead of Quick it'll say Xuick. There are additional problems with encoding into data section; we now need a magic global __DATA_LENGTH__ or something since you probably don't want to accidentally edit this static data.

const myString: i32[] = "Quick brown fox jumped over the lazy dog.";
// I need another array for some generic data
const foo: i32[] = __DATA_LENGTH__ + 1;

Not looking good! I guess the compiler could abstract that for you and prefix all addresses with __DATA_LENGTH__ but that creates some opaque magic. Can't edit these addresses from within Walt ever, even if we create a const ptr: i32[] = 0. We'll also need the magic constant exported opaquely for JS interop and writing into memory from the outside etc., All of this combined undermines some of the key goals of the project. To be unsurprising while allowing full feature-set of WebAssembly.

These are all just variations of existing array problems.

I think we really need to solve is the following problem first, how should this be represented in WASM?

let myArray: i32[] = 1;
myArray = [1, 2, 3, 4, 5, 6, 7, 8, 9];

Once this works, assigning a literal will just compile down to basic array assignment. Arrays now are just views into memory, and they compile down to individual load/store opcodes. What we are describing so far when we talk about a string is a special array, but it's probably better long term to align arrays and strings as much as possible.

  • Data section is mixed into the linear(dynamic) memory
  • WASM has no mass move or copy memory instructions. I actually think this is one of the biggest problems in WebAssembly currently, but whatever.

The only thing I can think of which makes sense is to encode these assignments into a block mem_store instructions for each index in the array literal. Which is not great, but get's around all the annoying issues with the memory I described above.

This also means that arrays should be encoded with a length as well. Right now they are simple views into memory but passing an array around results in a loss of this length information. I'm not entirely sure what would that look like in syntax. Maybe a difference between const i32[] vs let i32[]? For example, mutable arrays(let) can be encoded as a 64-bit value with the pointer and length encoded right into the local instead of linear memory.

I kind of like that idea, as the syntax above, would compile to this type of pseudo code

let __localBase = 1
let local = (length << 32) | __localBase;
memset(1, 1);
memset(1 + 1, 2);
// etc. until length 

There are obvious drawbacks to this approach, like the fact that mutable arrays must be re-assigned if you'd want them to be changed by a different function. Or that exporting them to JavaScript means exploding the array pointer into two 32-bit arguments. The wins are that indexing is the same, zero-indexed and that mutating array results in a single memory instruction and in bit math only to adjust the size. Indexing into these arrays would be a single 32-bit truncate call, which isn't bad. Wrapping one of these in a struct or putting it directly into memory can get around some limitations of having to pass these references around.

On Template Literals and custom encodings

This took forever to write down so I'm going to stop here for now, but I do enjoy the conversation. I like the idea of template literals! Also for using UFT-8/UTF-16 strings, WASM includes 8-bit and 16-bit load/store instructions so it would be trivial to encode these 👍 we would probably need to extend array types to u8[]/u16[] etc.,

Let's keep the conversation going, there should be a pretty clear roadmap to implementing these once we can decide on an approach.

Syntax

I wouldn't use the tagged template for the syntax, JavaScript will tend to use its features: replacement, multi-line, escapes, etc.

I would rather use function calls, like in Go for example:

a := string("a")
a = rune(97)

var b string

or as type annotation.

WASM has no mass move or copy memory instructions. I actually think this is one of the biggest problems in WebAssembly currently, but whatever.

I believe it will, bulk memory operations has a proposal IRC. You can probably implement it yourself before it's standard.

Representation

I would definitely go for the vec(bytes) (a length + the bytes) which is the standard for representing strings in the binary (module name, func names, etc) and fit in the data section (a lot of compiles does it).

The locals idea is interesting, the binary size will be bigger but it could be more efficient at runetime I guess. One drawback is that you would have to duplicate strings.

Liberté, égalité, fraternité for all utf encodings!

😄 that's a good example.


The discussion is already pretty straightforward 👍

Skippable side-track

ballercat (bc): Before I get into the things I agree with you on(everything?) let me reiterate what my thought process is with pretty much every feature.

I read disagree, and was actually looking forward to it! You must have given this a lot of thought, so a complete opposite perspective followed by a good discussion would surely lead to a better shared solution in the end. Then I read the rest and was like "where is the respectful well-argued disagreement? :(" and I had to scroll up again. Being on the same page is a nice compliment too though ;). (still think well-argued dissenting perspectives are healthy for the discussion)

Also, remember that I am a JavaScript programmers with no previous hands-on WASM experience, exploring it through Walt! I'm not fully up to speed on all technical jargon and machinery you mentioned, and I'll look it up (since I want to learn WASM anyway), but let's try to use that naivety for the better as much as we can (like finding blind spots that you develop when you know the material too well).

bc: I will always shy away from overly opinionated syntax which may hamstring the developer from doing something I have not envisioned initially. What this results in is a bit more manual work for end developer and fewer compiler magic tricks when possible.

FWIW, I noticed that, and it's why Walt immediately struck a chord with me.

bc: WASM has no mass move or copy memory instructions

Are you serious? Why?! Is it a safety issue or something?

xtuc: The discussion is already pretty straightforward 👍

Dammit man, you had one job! Stop agreeing! 😜 But I hope you keep following at least, an extra set of eyes is always good!

The Array problem

bc: Anyway, I'm going to make a case that introducing strings actually changes array semantics and that most string problems boil down to array problems.

Get the fundamental array question right, and everything that builds on it gets easier. Makes sense.

Ideally for strings, there is a sweet spot that lets us show that a string is just an array, revealing the low-level machinery, while still keeping convenience, readability and most importantly safety through making doing the right thing easy.

Should you be able to assign to a variable directly?

Actually, I initially found the address assignment very confusing. Did it stand for initial array length? Memory location? I guess it does make sense in that you can assign another number to it and it represents a new address, and that "pointers" in WASM are really just i32 indices.

Still, lets explore if alternate syntax would be helpful here, or some syntactic sugar with completely predictable desugaring logic. The idea is the lines below are syntactic sugar for:

    const myString: i32[] = 1;
    myString = "Quick brown fox jumped over the lazy dog.";

I am actually not really happy with any of the following ideas, but it gets the concept across. Listing my bad ideas might inspire others to come up with a good variant:

// @ stands for "at", so perhaps it is a good "memory address" symbol
// The following reads as "i32 array at 1" - I find the []@ awkard though
const myString: i32[]@1 = "Quick brown fox jumped over the lazy dog.";

// Switching to Go-style order would break backwards compatibility,
// but would avoid the []@. It also reads in English-language order:
// "myString is an array of i32s at memory position 1"
const myString: []i32@1 = "Quick brown fox jumped over the lazy dog.";

// Feels less awkward than []@ but I'm not feeling this one.
// Can't really explain why though - getting a bit "bikesheddy"
const myString: i32[]:1 = "Quick brown fox jumped over the lazy dog.";

// I kinda like this one actually: it feels typographically nicer,
// but using / this way has no programming precedent :/
// closest analogy I can think of is directory structures
const myString: i32[]/1 = "Quick brown fox jumped over the lazy dog.";

// Overload curly braces even more, probably a bad idea
const myString: i32[]{1} = "Quick brown fox jumped over the lazy dog.";

// Hmm… maybe? Kinda matches Memory< … > syntax
const myString: i32[]<1> = "Quick brown fox jumped over the lazy dog.";

Another thing with arrays is that I find it a bit odd to not have fixed length as an option for the type. There is no bounds checking of course, but one could have some type checking when passing parameters, or assigning one array address to another.

The other option could be that lengths could be used to make initialising multiple arrays in one block easier:

// Initiates five arrays of various sizes right behind each other
// I know Walt doesn't have a comma-operator (yet).
// Point is, imagine changing the length of `a`, and having to manually
// move all of these addresses... This kind of notation would let the
// compiler handle it for you.
const a: i32[197]<0>, b: i32[873], c: i32[416], d: i32[741], e: i32[251];

I will need to read a bit more on the rest of the workings of WASM to give feedback on the other things you discussed.

Function and/or Tagged Template notation

xtuc: I would rather use function calls, like in Go

Technically, tagged template literals are functions under the hood. So the following function for single chars can be used as both in JavaScript:

/**
 * Converts single-charstring into unicode code point
 * To be used inside Walt.
 * @param {string | string[]} ch
 */
function char(ch){
  // This probably needs betterchecks that throw at
  // compile-time to avoid bad inputs coming through.
  // Whether `ch` is of type `string` or `string[]`,
  // ch[0] returns the first character.
  if (
    ch && 
    ch.length === 1 && // ensure only 1 argument in case of backtick notation
    (ch = ch[0]).length === 1 && // if str, ch[0] still works, so this is safe
    typeof ch === 'string'){
    return ch.codePointAt(0);
  } else {
    throw "Not a string of length 1";
  }
}

// both work
let a = char`a`;   // 97
let b = char('b'); // 98
let c = char``;    // Error: not a string of length 1

The idea so far seems to be that tagged templates that represent built-ins will just be evaluated, and their returned values used as a replacement - a kind of "pseudo-macro". In which case both of these would work:

// if `char` is treated as a built-in, and evaluated at compile time, this just works:
let a: i32 = char`a`;
let b: i32 = char('b');

Whether or not Walt should support both ways of doing the same thing is different, question. I am not in favour, TBH - having more than one way to do the same thing will lead to confusion.

I prefer tagged templates because:

  • they communicate directly that this is a string (backticks) of a certain type (tag)
  • functions imply runtime function calls, which is not true for static strings
  • we may be able to keep all the benefits of backtick strings! (see below)

xtuc: I wouldn't use the tagged template for the syntax, JavaScript will tend to use its features: replacement, multi-line, escapes, etc.

Going with the "macro" approach should mean that almost everything you mention is implicitly supported through JavaScript, without any extra work for Walt (except having to be able to parse multiline backtick strings, which is a convenience worth it in and of itself IMO). The only exception is expressions:

// Multilines just include the newline, no work required by Walt
let multiline = `x
y`;
console.log([
  multiline.codePointAt(0), // 120, 'x'
  multiline.codePointAt(1), // 10, '\n'
  multiline.codePointAt(2)  // 121, 'y'
]
// '45', this particular case poses no problems
let sum = `${1+2+3+4+5+6+7+8+9}`;
// Are `a` and `b` defined? If run as-is, this would be referring
// to variables existing during the *compiler* runtime, not *Walt* run-time
let expr = `${a} ${b}`;

Now, one simple way to handle this this making ${} forbidden with a regexp: if(node.search(/\${[^\$]+}/g) !== -1) throw … (with node being the raw string of the ast node for a string, assuming that's how it is stored initially - haven't checked the compiler code yet). This would be a safe starting point regardless of future plans: least work, and it does not close the door on supporting it later.

Support would require two "versions" of each tagged template (or function). One for compile time, which is the relatively simple JavaScript implementation with no expressions. The other for runtime usage, which would get quite complicated, because it can have a variable number of arguments and would also ideally support strings (of each type) and chars. And then there is of course the desire to recognise when all expressions are compile-time constants to turn it into a static string after all and… oi… Also, this is the kind of high-level string programming where WASM probably sucks compared to plain JavaScript, TBH.

Let's just start with the no-expression version, ${} can always be introduced later after more deliberation on how to handle it. It only takes one human-friendly compiler error to learn that these are not supported in Walt (yet).

At that point, I don't see a strong advantage of the function notation over the tagged templates. But that's quite subjective of course.

String and char concatenation

Still, this does bring up an issue related to string interpolation: programmers probably expect to be able to concatenate strings, and strings and chars. Personally, I kind of feel like if you want to do string manipulation in WASM, you probably want to do low-level stuff with it anyway, so how much convenience should be added? And in which form

// Ignoring the encoding issue for now
let a: i32[] = 0
let a = 'Hello, World';
let b = char`!`;
let c: i[] = 12;

// What should we do here? `b` has no type information about
// being a char, so are *all* i32 values promoted string?
c = a + b;

// Also, concatenating a lot of strings will be slow, using redundant copying
// Perhaps there should be an in-place `append` function?
let x: i32[] = 25;
x = "Appending multiples strings ";
let y: i32[] = 53; // having to manually calculate this is getting a bit annoying
y = "without intermediate allocations ";
let z: i32[] = 86;
z = "would be nice, don't you think?";
append(x, y);
append(x, z);


// a = 'Hello, World!'
append(a, b);


// option one: should give a compile-time error,
// since the char is not an array, there is no
// location to store the result
// option two: b = 'ello, World!H'
append(b, a);

Hi there.

Okay, I've spent the better part of the week racking my brain on how to move forward with supporting some version of strings for Walt. The discussion here was very helpful in identifying the current issues and potential future pitfalls.

The tentative plan going forward which I'm willing to adopt will be a conservative approach based on established practices in handling strings in existing low-level Languages(C) and underlying representations of strings in high-level dynamic languages(PHP).

Key high-level points

  • There will be no built-in string type
    • Patterns and conventions will be used over special semantics and syntax. Though some new syntax sugar will be necessary (like string literals and template literals)
    • This also means there will be no special handling of strings(like concatenation semantics)
  • There will be no changes to i32[] array representation
    • Having thought about the "array problem" some more, I don't think it's the direction I'd like to move forward with. I'll go into more details on this below.
    • The reason for my backtracking is the realization that current Object/Struct syntax in Walt allows for these patterns already. Albeit not without some more changes and improvements.
  • There will be a Data section which will enable
    • Array Literals
    • String Literals
  • Array and String Literals will be encoded as vectors with a leading i32 length followed by i32[] data

Arrays

This is really a tough one. I thought about adding more sugar and runtime help for using arrays in Walt. But I keep coming back to the fact that this isn't the goal of the project at all. Rewriting JavaScript or creating a new dynamic language within WebAssembly is a rabbit hole which has to be avoided.

Ultimately I don't see a path forward without making some compromises on the issues I raised in my earlier comments. Like encoding strings into the data section and not providing additional string types and changing array semantics. I just don't see how Walt could reach a level of flexibility anywhere close to strings in JavaScript, it would always be a lesser implementation due to the constraints of WASM. And this should be fine, that's not what the project is about.

Support flexible (low-level)patterns over clever syntax

The idea for encoding strings/arrays as I'm about to describe them is really not new. My favorite example is the PHP-7 new string object. A small annoyance that PHP implementation runs into is that there is no direct path from a C string literal to a zend_string, but we will have no such limitation as literals will be encoded in a flexible, vector-like manner.

// test.walt
const memory: Memory<{ intiial: 1 }>;

type String = { 
   length: i32,
   chars: i32[]
};

export function countTheRs(str: String) {
   let i: i32 = 0;
   let count: i32 = 0;
   for(i = 0; i < str.length; i += 1) {
      if (str.data[i] == "r" || str.data[i] == "R") {
        count += 1;
      }
   }
   return count;
}

export function strcpy(to: String, from: String) {
   let i: i32;
   for (i = 0; i < str.length; i += 1) {
      to.data[i] = from.data[i];
   }
}

export function test(): i32 {
   const foobar: String = "foobar";
   countTheRs(foobar); // 1
   
   const fooobaz: String = memory.dataSize;
   strcpy(foobaz, foobar);
   foobaz[foobaz.length - 1] = 'z';
   return countTheRs(foobaz); // 0
}

Pretty sure this is possible to write now, the only problem is with dereferencing the data array in the struct. Under current implementation of Walt, foobar.data[0] would give the zeroth value at the memory address stored in byteAddressOf<foobar> + byteOffsetOfKey<data> aka 'f'[0], so it would result in garbage data. Instead, we would want the zeroth value at the memory offset where .data starts in the struct.

To make the above work, array types of objects would need to view into the memory offsets of the object itself, not absolute addresses. Which is fine.

const str: String = 1024; // pre populated string at address 1024;
str.length; // value at address 1024 + 0
str.data[1]; // value at address 1024 + 4 (number of bytes for length property) + (1 << 2)

What about actual pointers? If you wanted to store an actual pointer inside the object you may use a plain i32. The very fist implementation of Walt with working arrays allowed for dereferencing any i32 type, not just i32[] so it might be possible, pretty sure I patched it though. But you could always re-assign to an i32[] if it's not const arraFromPonter: i32[] = object.i32Ptr;

What about non-32-bit encodings? Should be trivial, once template literals are added for different encodings and less-than 32-bit array sizes. I'm okay with the idea by the way, even though it's a bit too clever on the surface. With the combo of a u8[] type, for example, you should be able to define the following and read from a previously encoded byte char string type String_u8 = { length: i32, data: u8 };.

What is memory.dataSize? This is the size of the data section(total byte length of all the static strings and arrays in the application). Something which will be necessary to have around, but it will be easier to reason about as part of the memory object declared in the source. It may be necesary to store it as a reserved header in the memory address space so that shared memories between modules can all be on the same page(heh).

There will be footguns.

One unfortunate but necessary side-effect of this is that there will be scenarios where the Engineer using the language could really mess things up for themselves.

  • Mutating the data section.
const str: String = "foobar";
str[str.length - 1] = "z" ; // :(
  • Wrong encoding for string types, wrong JavaScript interop. This could cause weird problems if not done correctly.

The compiler may try it's best to save you from them.

The compiler could reasonably try to prevent you from overwriting data sections when it's pretty obvious, but once you start sharing strings across functions and between JS <> WASM you will take manners into your own hands.

Power at your fingertips

Even though it will be possible to shoot yourself in the foot, it will also be possible to do whatever you want with strings and how you interop between other modules and JavaScript. Ultimately this goes back to the idea that Walt is C for the Web with JavaScript syntax.

All this means at the end of the day is that the Engineer needs to do the thing they want, the language won't hold their hand. I think I'm okay with that.

List of changes that need to happen.

The following features and changes would need to happen for my example above to work. This list is semi prioritized

  • Changes/improvements to array type semantics within structs
    • This will enable rather easy interop between JavaScript and Walt as far as strings are concerned
  • Character Literals
  • Array types smaller than i32[] (u8, u16, etc)
  • Data section
  • Reserved Memory header (starting at address 0)
  • String literal encoding (i32 only)
  • Template literals
  • Nice to have
    • Compile time warnings for accidental memory writes
    • Runtime optional guards around special memory addresses.

It's good to be able to backtrack when you realise you're on the wrong track. I think I'm missing a few of the leaps being made in your thinking though. For example:

type String = { 
   length: i32,
   chars: i32[]
};

So I can guess this type definition represents a vector, but I'm not quite sure how it is supposed to work later on:

const foobar: String = "foobar";
// ...
const fooobaz: String = memory.dataSize;

At no point is an address assigned to foobar.chars. Does the string literal do this automagically? I'm inclined to think that every time a string literal is used, it automatically is appended to the end of the Data section. What is memory.dataSize assigned to here? Does it point to memory address after "foobar", and therefore assign that address to foobaz.data? That is confusing to me, because this:

What is memory.dataSize? This is the size of the data section(total byte length of all the static strings and arrays in the application).

Implies to me that memory.dataSize is known at compile time, not something that varies during run-time or before/after string literal assignment.

Right, string literals need to be statically encoded into the data section of the binary which becomes a pre-defined memory region in the WebAssembly instance. In the binary, the literal will be represented as the address in this region.

That is why there are footguns. If you write into this memory you alter the values of string literals. It's possible to mitigate some of the risks but ultimately you can do whatever you want with your code.

I think Walts own specs would be a great testing ground for these concepts. Most tests are already very simple to convert, so I began a PR to do just this (#107).

Here is an example of exactly I was talking about earlier.

export function run() {
assert("array subscripts - (math should work)", testArraySubscript(), 4);
}

and then the JS bootstrap code for assert

env: {
memory,
assert(strPointer, value, expected) {
let text = "";
const length = view.getUint8(strPointer, true);
for (let i = 0; i < length; i++) {
text += String.fromCharCode(view.getUint8(strPointer + 1 + i, true));
}
t.is(value, expected, text);
},
},

Pretty neat. And it works too (I forced a failure to see the output string)

screen shot 2018-03-17 at 7 25 00 pm

I took a lot of liberties with the encode which will need to be patched before it's ship-able but it's looking promising 👍

Right, string literals need to be statically encoded into the data section of the binary which becomes a pre-defined memory region in the WebAssembly instance. In the binary, the literal will be represented as the address in this region.

Ok, I follow now. Am I correct in guessing that array literals are the same, and that string literals are, in a sense, just syntactic sugar over the same array literals?

What I don't quite see is how this is then converted to:

// I now have a type that is a struct with a
// `length` and `chars` field
type String = { 
   length: i32,
   chars: i32[]
};

type SomeOtherString = {
  length: i32
  chars: i32[]
  someOtherThing: i32
}

// structs respect the order in which I
// declare the fields, right?
type IsThisAStringIDunno = {
  chars: i32[]
  length: i32
}

// Is automatic assignment of length to foobar.length,
// and memory address to foobar.chars part of the way
// that string literals work?
const foobar: String = "foobar";

// If so, does it work with these types?
const SomeOtherString : String = "Is this assigned properly?";
const fooballicious: IsThisAStringIDunno = "How about this?";

(I don't know if I'm asking dumb questions here or not)

Other questions:

  • does it make sense to have chars for Strings, and data (or whatever) for Array literals? Why not just have data for both, since really, Strings are just arrays?
  • what about the aforementioned "size" for arrays? Or more importantly: the current need to manually adjust memory locations if you have multiple arrays and change the size of one or the other.

// structs respect the order in which I
// declare the fields, right?

Correct. Fields are byte offsets from the base address matching the order of definition in a struct type. So for your String type above length is offset 0, chars is offset 4 and so on if you add more fields.

// Is automatic assignment of length to foobar.length,
// and memory address to foobar.chars part of the way
// that string literals work?

Pretty much. Because the memory is already assigned statically by the time the code executes the field offsets of the String type automatically align with the data inside memory.

(I don't know if I'm asking dumb questions here or not)

I don't think so!

does it make sense to have chars for Strings, and data (or whatever) for Array literals? Why not just have data for both, since really, Strings are just arrays?

Yeah, it's not a built-in type so you could really define it how you like. There isn't any special meaning in the type, at least for now. If it were built-in, it would probably be a Vector<type> with data for all use-cases instead. But any type aligning with the memory would work, even this

type String = { length: i32, b: i32, a: i32, r: i32 };
// ...
const str: String = "bar";
str.b; // number value of 'b'

what about the aforementioned "size" for arrays? Or more importantly: the current need to manually adjust memory locations if you have multiple arrays and change the size of one or the other.

Not sure about this one. The memory management is manual, that can be mitigated with even a simple malloc implementation though.

How about immutable data types? A large part of the JS community is leaning towards immutability and it can solve some of the issues discussed here: by banning property mutations and using methods like map, filter and slice strings and arrays memories can be mapped every time but the footprint of each variable would be consistent

Well, part of the point of low-level access is performance, and mutability is quite crucial for that in many situations.

Supporting optional immutability in a better way than JavaScript-style const and let currently allows sounds great, but half-baked immutability is also a bit of a false guarantee: unless everything is immutable, you usually end up in situtations where one can still access and mutate all of these things through other means.

ALso, I don't know if allowing a way to ban property mutations conflicts with ballercat's stance of:

Rewriting JavaScript or creating a new dynamic language within WebAssembly is a rabbit hole which has to be avoided.

Which I also agree with (I'm reminded of the Go developers fighting feature creep at every turn).

Perhaps it's worthwhile to make a list of "which extra features really would help and don't open a "turning into a new language"-rabbit hole?

Regarding map, filter and slice: they sound like they should be implemented as some kind of utility library function, written in Walt itself (similar to ballercat's mention of implementing your own heap). Their implementations could even make nice examples for the explorer :)

If implemented correctly immutability shouldn't hurt performance: it's just an abstraction.

https://www.quora.com/Is-object-immutability-in-functional-programming-inherently-performance-intensive

Regarding map, filter and slice: they sound like they should be implemented as some kind of utility library function, written in Walt itself (similar to ballercat's mention of implementing your own heap). Their implementations could even make nice examples for the explorer :)

I disagree, I think should be the standard way of doing things and be optimised at the compiler level so people will use it instead of mutation which it's memory footprint is harder to track at compile time

I'm open to some high-level discussion on the topic but I'm not sure if I understand what problems discussed here would immutable data types solve or how does that apply to string encoding.

Not that they are inherently not a good idea or in some way not useful. I guess what I'm getting at is that safe immutability is un-enforceable when low-level mechanisms are already exposed.

Consider these (overlapping) reasons why this is true:

  • There is no runtime to perform the type checks
  • There is no Garbage Collection to hide the memory gymnastics
  • There is no memory guards or some virtual memory space you can play in. 100% of the allotted memory bits are accessible and thus editable.

To enforce true immutability you need all three, rather you need the first two and for the third point you need a completely opaque memory space for the application. Which sounds a lot like JavaScript, and really isn't the point of the project.

Now, that being said there can be some nice features like deep constant types which would echo some of the semantics of const in C/C++ for example. Is this what you're getting at? We could discuss these in another issue, as long as we understand that these are simply sugar and nothing more(the memory can always be changed).

Also, immutability in JavaScript is more of a pattern, Objects in JS are inherently mutable. We choose to use immutable patterns as they usually lead to a stable(er) system. Nothing is preventing an Engineer down the road from applying the same patterns in Walt as well. We could go into more details, but it's probably best to discuss in another issue.

We do, but many would have preferred it won't be available by default.

  1. Types are defined statically and by now are pretty easy to track so no runtime type checks are actually necessary
  2. I don't see in a fully typed languages with simple operators immutable data structures requires GC, the memory for allocations is known in advance: Rust does it very well already.
  3. I think that If you edit memory bits it's your responsibility to make sure your program doesn't fail. The compiler can make assumptions based on the code provided for it.

Actually prior art has been done https://news.ycombinator.com/item?id=12426744

bc: how does that apply to string encoding.

I assumed it has to do with the question of how to reduce the footgun problem when it comes to mutating the data section.

Anyway, I'm probably just repeating ballercat's points with different phrasing in response to the next bits, and my entire comment is completely off-topic for the Strings discussion, so if we want to continue this discussion we should open a new issue for it.

iddan: If implemented correctly immutability shouldn't hurt performance: it's just an abstraction. https://www.quora.com/Is-object-immutability-in-functional-programming-inherently-performance-intensive

Could you elaborate what you mean with "just an abstraction"? Because I think it really downplays the rabbit hole that this topic is. I doubt implementation-sensitive persistent data structures that come with their own trade-offs and limitations qualify as "just abstractions".

And don't get me wrong: persistent data structures are amazing, but defaulting to them sounds like a very different type of language than what Walt appears to be going for, which is:

C for the Web with JavaScript syntax.

C as in "barely higher level than assembly" (or WASM in this case). Note: this is not even C++ with its RAII and destructors. It's more basic than that. Sure, there are improvements over C to be made, like perhaps including option types and hygienic macros, but this kind of immutability sounds like it would be way outside of the scope of the project.

Types are defined statically and by now are pretty easy to track so no runtime type checks are actually necessary

I sincerely don't know what kind of type system Walt is built on, and if it could it make this easy. In general, a type system having that kind of capabilities requires something like Rust's borrow checker, no? Or at least its affine type system. The only other languages that I know that have both low-level performance with immutability guarantees are Pony and Kitten (which happens to be the work by evincarium in the post that you referred to).

I've followed the development of those: Pony has a type system with deny capabilities. Kitten is a typed concatenative language with a permission-system. None of these options seem trivial to get right.

Rust does it very well already.

Yes, but Rust is a project backed by a large organisation like Mozilla, with tons of contributors. Walt can't really compete with that. Besides, if you want what Rust offers, then why not just use Rust?

To tie it to my earlier point: these are amazing languages, but they are also trying to do very different things than Walt.

Walt (edit: as I understand ballercat) is not trying to be a safe modern systems language for large software projects like Rust or Pony, nor an experimental strongly typed concatenative language like Kitten (being able to write WASM in Kitten sounds great though).

Maybe something for a compiles-to-Walt language?

Anyway I think using template functions is a very Pythonic syntax. I suggest using types for different sting types

I think using template functions is a very Pythonic syntax

I think we've reached the part of the discussion where bike shedding starts to take plcae, so we should thread carefully 😅. Could you explain what you mean by "Pythonic" and what the issues are with it?

I've opened a different issue on Walt's type system - #109 - because I realised I really don't know the extent of its capabilities. I'm all for powerful types and using them, but we need to be sure that we are on the same page there or our discussions won't be very fruitful - I can't judge if using types for strings has merit or not right now.

That way or another you would need to understand string’s types. I meant Pythonic because in Python there’s ‘b"foo"’

I meant Pythonic because in Python there’s ‘b"foo"’

Aaah, like that! And in Python this is necessary because it is dynamically typed, but that probably does not have to apply here (remember that I forgot Walt does have static typing before).

That way or another you would need to understand string’s types.

So for the sake of explicitness, I'm going to write out the type-based approach.

Say Walt ends up with a simple type system like C (it surely can do better than that while still staying low-level, but let's take this as a "simplest case"-baseline until the types discussion takes off and goes somewhere). Then, in oversimplified terms, a type just determines the shape of the memory layout, and/or how values are coerced when being assigned.

In C that works because C is just number values and pointers, arrays "are pointers" (not really but keeping things simple here), and strings are just null-terminated arrays.

How would having multiple string types work with string literals?

let str = 'type inferred, which will probably be one char per i32, so essentially "UTF32"'
let s8: Utf8 = 'Converted to UTF8, packed in 32-bit words';
let s16: Utf16 =  'Converted to UT16, still packed in 32-bit words';
let a: Char = 'a';

Walt would have to determine the type on the LHS, then when converting the string literal to static data would use a type switch to get the right method. Guess that's simple enough.

The only thing is that all of these types methods would need to be built in.

If we could tempt @ballercat into taking a look at Julia's type system with customisable promotion rules, and see if that could be done in a static, compile-time fashion, we could keep "core Walt" lightweight and add these conversions via libraries 😉

No but seriously, it may well be that that would be, effectively, the same amount of work: write the code as a built in, or you write it as a library. The difference is that the core language can remain minimal.

There have been some developments on the string encoding front.

I mentioned that I think self-hosting some code within Walt as a testing ground for string handling is a good idea. So I've been moving along with writing specs for the compiler, with Walt. It's been a bumpy ride but I'm pretty happy with this direction.

String Encoding

After POC-ing several approaches I settled on a LEB128 unsigned encoding for all strings. This is by far the most flexible encoding since it allows for simple ASCII as well as full UTF code points. But for most use-cases, only 8-bits will be used for a character. A simple representation of the encoding looks like this varuint32 length, varuin32 chars[0 ... length]. It's the internal binary encoding used inside WebAssembly, so the idea is taken directly from it. One small but important point is that static strings are not padded with 0x80 bytes so the individual character byte length may vary within a single string.

Knock on effects

The exercise of trying to make strings work had some practical implications.

String type

In practice, the string type does not work out well due to the encoding I chose for it. A variable length encoding which I adopted does not map to static structs at all since, well, the encoding is of variable length. Passing around pointers works out just as well.

I could see a use for a string built-in type to at least provide some type safety at compile time though. Similar to the issue of boolean types (#108)

Decoding

If you care about anything outside the base 127 ASCII characters you are going to have to do some bit math. LEB128 reserves the high order bit(0x80) as the link bit. That being said, the decoding of unsigned leb128 is simple, here is a JavaScript version of the generator for the assertions text. A DataView is used which is pointing at the WASM memory buffer. The start address comes directly from wasm code calling into the assert method.

export function* stringDecoder(view, start) {
let length = 0;
let index = 0;
let shift = 0;
let addr = start;
while (true) {
const byte = view.getUint8(addr, true);
length |= (byte & 0x7f) << shift;
addr += 1;
if ((byte & 0x80) === 0) {
break;
}
shift += 7;
}
let result = 0;
while (index < length) {
result = 0;
shift = 0;
while (true) {
const byte = view.getUint8(addr, true);
result |= (byte & 0x7f) << shift;
addr += 1;
if ((byte & 0x80) === 0) {
break;
}
shift += 7;
}
index += 1;
yield result;
}
}

But again if you only care about ASCII you may simply loop over the bytes until you read over the entire length. I could maybe be convinced to make the length header a full 32-bits instead of varying length to really make this trivial.

If you're interested here is the same thing but in Walt! No generator functions, but more or less the same approach. This spec implements some string iterators and a naive indexOf inside Walt 🎊

export function run() {
assert("string iterator, count ASCII", countCharacter("aabbccddaa aaa", 97), 7);
assert("string iterator, count emojis", countCharacter("aabb😂😂😂 aaa😂", 128514), 4);
assert("string iterator, indexOf ascii", indexOf("quick brown fox", "brown"), 6);
assert(
"string iterator, indexOf utf8",
indexOf('Liberté, égalité, fraternité for all utf encodings!', 'égalité'),
9
);
}

Exposed Native Opcodes

I gave u8[] arrays a shot, they are a pain to implement. u8 isn't a valid size/type and supporting it takes a bunch of compiler magic at this point. Instead, I'm going to expose native load_u8 opcodes which are a cinch to use. Here is an example of the Walt string spec I've been implementing

While it may look like a function call it's actually a single wasm opcode and it is quite easy to work with. I think I'll expose only the native load/store instructions with <type>.<method> syntax.

This means I won't be implementing the different array types u8, u16 I mentioned before.

Self Hosting

The process of doing the work to self-host only a couple of tests has been valuable. While I only wanted to test strings in practice it is proving to be useful in pointing out Developer Experience issues and boy is it great for finding bugs. I'll more than likely pump the breaks on most new sugar and focus more on making the syntax more practical as I begin to convert actual pieces of the compiler to Walt. Seems like a pretty good way to ensure that the tool does what it says when you dogfood it 😂

I gave u8[] arrays a shot, they are a pain to implement. u8 isn't a valid size/type and supporting it takes a bunch of compiler magic at this point.

You know, I was wondering: could there be a use-case for having something like hygienic macros or some other kind of pre-processor thing that is more modern and safer than what C has to offer, that would both make extending the compiler easy for you, let others build their own extensions, and keep the core code modular?

You know, I was wondering: could there be a use-case for having something like hygienic macros or some other kind of pre-processor thing that is more modern and safer than what C has to offer, that would both make extending the compiler easy for you, let others build their own extensions, and keep the core code modular?

It would be helpful, I think the project would need to mature more and stabilize before this feature though.

I'm going to be closing the issue as static strings were implemented and tested as part of (#107). This is about as much I'd like to implement as far as strings at this point from the compiler/language side of things.

Thank you very much for excellent discussion and brainstorming.

Glad the discussion was of use!

My timing is a bit unfortunate now, but someone recently pointed out the TextEncoder and TextDecoder API to me, which converts JS strings to UTF8 Uint8Arrays and back. It seems like the ambition is for these APIs to be the go-to methods for converting strings to and from WASM code.

Note: Firefox, Chrome and Opera used to have support for encoding types other than utf-8 (such as utf-16, iso-8859-2, koi8, cp1261, and gbk). As of Firefox 48 (ticket), Chrome 54 (ticket) and Opera 41, no other encoding types are available other than utf-8, in order to match the spec. In all cases, passing in an encoding type to the constructor will be ignored and a utf-8 TextEncoder will be created (the TextDecoder still allows for other decoding types).

https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder

https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder

It is not fully supported, but part of the living standard. Probably too much of a moving target anyway, but probably good to have here for future reference, in case the strings discussion needs to be revisited at some point.

I'm a little confused as to what the final implementation ended up being.. The wiki doesn't seem to be updated with how to deal with strings..

could someone post a "Hello world" example or something..?

The compiler exports a stringDecoder and stringEncoder utilities. It's used extensively in the self-hosted compiler specs. stringDecoder returns a generator.

I did not actually run this code below, but it should work just fine.

import compile, { stringDecoder } from 'walt-compiler';

export const getText = view => ptr => {
  let text = "";
  const decoder = stringDecoder(view, ptr);
  let iterator = decoder.next();
  while (!iterator.done) {
    text += String.fromCodePoint(iterator.value);
    iterator = decoder.next();
  }

  return text;
};

const source = `
   export const memory : Memory = { initial: 1 };
   export function hello() : i32 {
      return "Hello World!";
   }
`;

WebAssembly.instantiate(compile(source)).then(({ instance }) => {
    const view = new DataView(instance.exports.memory.buffer);
    const decodeText = getText(view);

    console.log(decodeText(instance.exports.hello()); // "Hello World!"
});

Hmm.. escaping doesn't seem to work..

"Hello\nworld!" just seems to literally result in Hello\nworld! instead of

Hello
world!

Escape sequences are not treated in any special way by the compiler(exception are character literals).

It's not that we couldn't, but I did not want to make a decision about that yet. The reason for that is because the decoding logic can handle this on the JS side (if '\' followed by a character then it's an escape sequence... treat it differently)

It's reasonable enough that I might change this in the future, but for now extending the JS glue logic will fix this.

Fair enough.. in the meantime, could you treat backticks `` the same way as quotes? not necessarily a complete string template implementation, but just to make it possible to make multiline strings..

Sure thing. Would you mind putting up a small issue about this? I don't know if I can get to it today and I don't want to forget about it.

Here's a more canonical way to write getText

const getText = (memory) => (pointer) => {
  const codePoints = []
  for (const codePoint of stringDecoder(new DataView(memory.buffer), pointer)) {
    codePoints.push(codePoint)
  }

  return String.fromCodePoint(...codePoints)
}

Note I changed the API, it takes a WebAssembly.Memory, not a DataView.