ICU for Unicode handling?

Question

ICU for Unicode handling?

shermp opened this issue 6 years ago · comments

I'm mulling the idea of adding basic freetype2 support, and was having a look at the FBInk codebase to see if I could figure out how to add support, and I've noticed your rant on Kobo's broken libc with regard to unicode support.

I notice that the Kobo firmware appears to include the ICU library (libicu*.so, vers. 4.6). Have you looked into using this library for dealing with strings in FBInk?

The API documentation for ICU 4.6.1 is here

NiLuJe · Answer 1 · Sat Aug 18 2018 12:55:16 GMT+0800 (China Standard Time)

I ended up skirting the issue with libu8, and, provided no-one tries to feed us hopelessly broken encoding, that does the job just fine without having to massively rework how strings are handled ;).

(ICU is a very very large hammer to take care of the Unicode issue, and the fact that wchar_t is just hopelessly broken on Kobo probably doesn't help. Plus, the fact that some of our target devices either don't ship it, or ship wildly different versions is another thing against it, because bundling it is not an option: besides the fact that it's C++, and takes forever to build, libicudata is over 25MB in ICU 60.2 ;)).

Sherman Perry · Answer 2 · Sat Aug 18 2018 13:20:52 GMT+0800 (China Standard Time)

Ah, fair enough. Carry on...

/me keeps forgetting kindles exist 😈

I read a blog post a while back, where the author advocated using UTF-8 internally, and therefore sticking with the standard *char data type. The author argued that many of the most common string operations only care about bytes, and not characters. Also, UTF-8 is a sequence of bytes, so endianess doesn't matter. I found it a rather fascinating read.

NiLuJe · Answer 3 · Sat Aug 18 2018 13:49:53 GMT+0800 (China Standard Time)

That's essentially what I ended up going with ;).

I think I may have read that very same article, (if it mentioned doing sanitization/conversions at I/O boundaries, that's the one). But with the hobbled libc, I can't really do the sanitization/conversion bit, since any libc-based locale/multibyte/widechar stuff is basically borked ;).
So I'm just skipping that, and hoping really hard no-one will feed us KOI8-R or something xD.

Sherman Perry · Answer 4 · Sat Aug 18 2018 14:01:14 GMT+0800 (China Standard Time)

It probably was the same article :p

I've been looking into this area a bit lately, because I'm trying to see if I can add differential support to my VHD library, and filepath strings there are encoded as UTF16BE.

Incidentally, do you know of any good cross platform C file path library?

NiLuJe · Answer 5 · Sat Aug 18 2018 14:10:46 GMT+0800 (China Standard Time)

Not really, the only thing that comes to mind is C++ (namely, boost) :/.

NiLuJe · Answer 6 · Sat Aug 18 2018 14:12:10 GMT+0800 (China Standard Time)

And I really don't want to say glib on the C side of things, because glib's weird, and I'm not even sure it'd do what you need ;).

NiLuJe · Answer 7 · Sat Aug 18 2018 15:17:29 GMT+0800 (China Standard Time)

You might also find something interesting either in stb or some other small libs like that ;).

Sherman Perry · Answer 8 · Sat Aug 18 2018 20:37:42 GMT+0800 (China Standard Time)

Thanks for the suggestions. I didn't see anything that really struck me as being suitable for my requirements (simple though they may be; path joining and normalization).

Sherman Perry · Answer 9 · Fri Aug 24 2018 11:10:28 GMT+0800 (China Standard Time)

I had another look at that STB link, and noticed I had missed the stb.h file the first time around.

Oh my... that looks just about perfect :)