Unicode Support and UTF Encoding

Question

Unicode Support and UTF Encoding

Shinmera opened this issue 9 years ago · comments

Unicode is the de-facto standard encoding for text nowadays. As such, Clasp must support it in order to be able to run a lot of useful software. As an initial suggestion, using UTF-32 internally for string would be a good choice since it will fit the entirety of Unicode into a single character and thus allow constant time access on strings. The size should not be a problem on modern systems. For external formats, UTF-8 and UTF-16 support should also be added.

Since Clasp's main purpose is interaction with C++ libraries, a variety of support functions and mechanisms might have to be added to ease the conversion and sharing of string data between Clasp and external or bound libraries. This might necessitate supporting different string representation formats internally to allow relatively efficient handling of strings without having to rely on conversion every time the Clasp/Library boundary is overstepped.

Alex Wood · Answer 1 · Wed Jul 01 2020 00:21:21 GMT+0800 (China Standard Time)

Currently character strings (as opposed to base strings) are UTF-32. It looks like we have UTF-8 and UCS-4 (which is the same as UTF-32, I think) as external formats. We also have UCS-2 which is not actually equivalent to UTF-16 I think. We don't have any support for unicode algorithms like normal forms, etc. We do not handle some character functions correctly, e.g. (char-downcase #\CYRILLIC_CAPITAL_LETTER_IE_WITH_GRAVE) => #\CYRILLIC_CAPITAL_LETTER_IE_WITH_GRAVE

Attila Lendvai · Answer 2 · Thu Dec 24 2020 22:58:52 GMT+0800 (China Standard Time)

a potentially crazy idea: integrate babel into the build/bootstrap process and use it?

https://github.com/cl-babel/babel

it only covers the external format reading/writing/conversion, though.

Serentty · Answer 3 · Wed Jul 07 2021 17:57:33 GMT+0800 (China Standard Time)

As an initial suggestion, using UTF-32 internally for string would be a good choice since it will fit the entirety of Unicode into a single character and thus allow constant time access on strings.

Constant time access to code points isn't really that useful. Most situations where you would think it's what you want, it's actually not, because code points don't correspond exactly with anything meaningful user-facing characters.

Yukari Hafner · Answer 4 · Wed Jul 07 2021 18:00:07 GMT+0800 (China Standard Time)

It is in the context of common lisp because most sequence functions use aref et al to access things in strings, which is indexed per character, and thus require constant time random access to be efficient at all.

Serentty · Answer 5 · Wed Jul 07 2021 18:01:52 GMT+0800 (China Standard Time)

I see. So they can't use something more like an iterator? The language mandates that they go index by index, and that implementation detail can't be hidden?

Yukari Hafner · Answer 6 · Wed Jul 07 2021 18:04:24 GMT+0800 (China Standard Time)

Yes. Strings are vectors, and violating the assumption that they can be accessed in constant time per index will lead to lots of problems. You'd have to devise a separate string type entirely, with an API that does not follow the vectors one. I think @Bike did make a library for utf-8 strings using the extensible sequences API as a proof of concept once, for instance.

Alex Wood · Answer 7 · Wed Jul 07 2021 21:44:35 GMT+0800 (China Standard Time)

I see. So they can't use something more like an iterator? The language mandates that they go index by index, and that implementation detail can't be hidden?

In the "sequences" extension to the language, you can implement an iteration protocol which is then used by standard sequence functions (map, reduce, etc.). But strings are vectors, and the iterator implementation for vectors does use indices.

As Shinmera said, I did write a library to use UTF-8 encoded strings, which defines iteration without relying on indices so much. For example here is the "next" function: https://github.com/Bike/utf8string/blob/master/utf8string.lisp#L456-L464

Could I ask why this issue has been revived? I mean, we never closed it, because we lack some aspects of Unicode support (like anything analogous to sb-unicode), but we do have UTF-32 strings and such now and have for a while.