Convert 4 byte utf8 to two surrogate utf16 and back not implemented
rovasiras opened this issue · comments
Converter function utf8s to utf16 has not capatiblity 4 byte utf8s to two utf16 surrogates and back. It could be resolved with
4 utf8->utf32->2 utf16 surrogates conversion. See: Unicode faq about "utf8, utf16, utf32".
The back conversion is implementable with same logic. The utf8s first byte xf0 to xf7.
It needs for languages/scripts which are in the codearea >0xffff. I think, for example Old Hungarian scripts.
@caolanm what do you think?
Theoretically, C++11 provides such transformations in the <codecvt>
header, and WinAPI in the functions MultiByteToWideChar/WideCharToMultiByte. But <codecvt>
is deprecated in C++17 and WinAPI is available only on Windows and require two iterations over the string.
Theoretically, C++11 provides such transformations in the
<codecvt>
header, and WinAPI in the functions MultiByteToWideChar/WideCharToMultiByte. But<codecvt>
is deprecated in C++17 and WinAPI is available only on Windows and require two iterations over the string.
@cuellius I know it well. I wrote about hunspell/csutil.cxx file's u16_u8 and u8_u16 functions!
@caolanm In the hunspell/csutil.cxx source code commented in u8_u16 and u16_u8 functions as conversion 4 utf8 byte are not implemented yet.
I think, the algorithm in #851 (comment) is could be useful.
See: Unicode faq about "utf8, utf16 utf32"
@caolanm I'm working on it.
@caolanm There is in the Unicode standard 15.0 version arabic extension, which section is upper than 0xFFFF code.
However I promissed, I will to do prepare this problem, I can not. I have a lot of another works.