hunspell / hunspell

The most popular spellchecking library.

Home Page:http://hunspell.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Convert 4 byte utf8 to two surrogate utf16 and back not implemented

rovasiras opened this issue · comments

Converter function utf8s to utf16 has not capatiblity 4 byte utf8s to two utf16 surrogates and back. It could be resolved with
4 utf8->utf32->2 utf16 surrogates conversion. See: Unicode faq about "utf8, utf16, utf32".
The back conversion is implementable with same logic. The utf8s first byte xf0 to xf7.
It needs for languages/scripts which are in the codearea >0xffff. I think, for example Old Hungarian scripts.

commented

Theoretically, C++11 provides such transformations in the <codecvt> header, and WinAPI in the functions MultiByteToWideChar/WideCharToMultiByte. But <codecvt> is deprecated in C++17 and WinAPI is available only on Windows and require two iterations over the string.

Theoretically, C++11 provides such transformations in the <codecvt> header, and WinAPI in the functions MultiByteToWideChar/WideCharToMultiByte. But <codecvt> is deprecated in C++17 and WinAPI is available only on Windows and require two iterations over the string.

@cuellius I know it well. I wrote about hunspell/csutil.cxx file's u16_u8 and u8_u16 functions!

@caolanm In the hunspell/csutil.cxx source code commented in u8_u16 and u16_u8 functions as conversion 4 utf8 byte are not implemented yet.

I think, the algorithm in #851 (comment) is could be useful.
See: Unicode faq about "utf8, utf16 utf32"

@caolanm There is in the Unicode standard 15.0 version arabic extension, which section is upper than 0xFFFF code.
However I promissed, I will to do prepare this problem, I can not. I have a lot of another works.