Convert 4 byte utf8 to two surrogate utf16 and back not implemented

Question

Convert 4 byte utf8 to two surrogate utf16 and back not implemented

rovasiras opened this issue 2 years ago · comments

Kovács Viktor and Tisza István developers commented 2 years ago

Converter function utf8s to utf16 has not capatiblity 4 byte utf8s to two utf16 surrogates and back. It could be resolved with
4 utf8->utf32->2 utf16 surrogates conversion. See: Unicode faq about "utf8, utf16, utf32".
The back conversion is implementable with same logic. The utf8s first byte xf0 to xf7.
It needs for languages/scripts which are in the codearea >0xffff. I think, for example Old Hungarian scripts.

Kovács Viktor and Tisza István developers · Answer 1 · Sat Sep 17 2022 02:15:12 GMT+0800 (China Standard Time)

@caolanm what do you think?

Jan · Answer 2 · Tue Oct 04 2022 19:39:35 GMT+0800 (China Standard Time)

Theoretically, C++11 provides such transformations in the <codecvt> header, and WinAPI in the functions MultiByteToWideChar/WideCharToMultiByte. But <codecvt> is deprecated in C++17 and WinAPI is available only on Windows and require two iterations over the string.

Kovács Viktor and Tisza István developers · Answer 3 · Wed Oct 05 2022 03:04:26 GMT+0800 (China Standard Time)

Theoretically, C++11 provides such transformations in the <codecvt> header, and WinAPI in the functions MultiByteToWideChar/WideCharToMultiByte. But <codecvt> is deprecated in C++17 and WinAPI is available only on Windows and require two iterations over the string.

@cuellius I know it well. I wrote about hunspell/csutil.cxx file's u16_u8 and u8_u16 functions!

Kovács Viktor and Tisza István developers · Answer 4 · Wed Oct 05 2022 20:20:52 GMT+0800 (China Standard Time)

@caolanm In the hunspell/csutil.cxx source code commented in u8_u16 and u16_u8 functions as conversion 4 utf8 byte are not implemented yet.

I think, the algorithm in #851 (comment) is could be useful.
See: Unicode faq about "utf8, utf16 utf32"

Kovács Viktor and Tisza István developers · Answer 5 · Wed Oct 05 2022 23:05:52 GMT+0800 (China Standard Time)

@caolanm I'm working on it.

Kovács Viktor and Tisza István developers · Answer 6 · Wed Oct 19 2022 23:47:34 GMT+0800 (China Standard Time)

@caolanm There is in the Unicode standard 15.0 version arabic extension, which section is upper than 0xFFFF code.
However I promissed, I will to do prepare this problem, I can not. I have a lot of another works.