UTF encoding/decoding instructions

Question

UTF encoding/decoding instructions

CAFxX opened this issue 2 years ago · comments

Carlo Alberto Ferraris commented 2 years ago

Sorry for the driveby question, I tried searching on the ML and in the existing issues but could not find any previous discussion about this. If this has already been answered, any pointers to the relevant resource(s) would be greatly appreciated.

Was it ever considered/discussed to add, likely in a new dedicated subset of this extension, instructions for encoding/decoding UTF-8 (and ideally also UTF-16 and UTF-32)? Most text processed today is in one of those encodings¹, and there is little on the horizon that would suggest upcoming changes to this status quo; decoding/encoding UTF is not especially complicated without dedicated instructions (and the existing bitmanip instructions can already help), but given the ubiquity of these encodings and the relative underlying logical simplicity of the coding process (at their heart, UTF-8 and UTF-16 are simple-to-decode VLEs) there may be efficiency benefits² to be obtained with dedicated support.

Just for the sake of clarity, in its simplest form (covering only UTF-8 → codepoint decoding) this would require a single instruction that takes a 4 bytes input (the maximum length of a UTF-8 encoded codepoint, likely obtained via an unaligned read from memory), and returns the decoded Unicode codepoint (3 bytes), how many bytes of the input were consumed (between 1 and 4, included), and whether the decoding encountered an error (the necessity to return multiple values is probably the biggest roadblock to inclusion in the ISA, albeit I suspect there may be workarounds).

Extensions to the simplest form could include, as hinted to above:

the reverse operation (codepoint → UTF-8)
support for backwards³ decoding
support for {en|de}coding UTF-16/UTF-32 ↔ codepoint
bi-endianness (for UTF-16 and UTF-32)
support optional relaxed compliance to the UTF specifications (to cover UTF variants used in the wild)

Going further, it is potentially even possible to imagine an expansion (outside of this extension) to a packed SIMD version⁴ of the same operations, able to {de|en}code multiple codepoints at the same time.

and this includes resources with text representation even if not exclusively meant for direct human consumption, like JSON, CSV, HTML, and other source code ↩
while the English-speaking world may have historically been fine assuming that most text would be quickly parseable in the ASCII-subset of UTF-8, so the need for efficient non-ASCII codepoints handling was lesser, this has never been true in the rest of the world ↩
i.e. the ability to decode a codepoint knowing where the last byte of the encoded representation is (instead of knowing where the first byte of the encoded representation is); this is useful when iterating backwards over text ↩
or even a vector version, albeit this would possibly require a prohibitively high gate count for any reasonable VLEN ↩

svobodnik · Answer 1 · Tue Mar 28 2023 07:14:58 GMT+0800 (China Standard Time)

CISC-V when?

UTF encoding/decoding instructions

Footnotes