titzer / virgil

A fast and lightweight native programming language

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UTF-8 string literals

srackham opened this issue · comments

Does Virgil support UTF-8 string literals?

The documentation suggests it does:

- Aeneas parses text as bytes, only allows UTF-8 inside string constants

Here I've inserted the copyright character in a string literal:

$ cat hello.v3    
def main() {
        System.puts("Hello World ©\n");
}

$ virgil run tmp/hello.v3
[tmp/hello.v3 @ 2:21] ParseError: invalid string literal
        System.puts("Hello World ©\n");
                    ^

Hex byte values work though:

$ cat hello.v3
def main() {
        System.puts("Hello World \xC2\xA9\n");
}

$ virgil run hello.v3
Hello World ©

You're right, that's a bug. It should handle UTF-8 in string literals, but it does not yet.

I was planning on improving the support for unicode by changing the string type (currently an alias for Array<byte>), but this is something that could maybe supported by just allowing the UTF-8 representation through.

Thanks.

A workaround is to convert UTF-8 strings to hex byte values with, for example:

$ echo -n "Hello World ©" | od -A n -t x1 | tr -d '\n' | sed 's/ /\\x/g'
\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x20\xc2\xa9