Add optional UTF-8 Display/File character support.

Question

Add optional UTF-8 Display/File character support.

taviso opened this issue 2 years ago · comments

Lotus 1-2-3 predates UTF-8, and uses LMBCS internally, which is sort of a precursor to unicode.

I see no reason we couldn't add a UTF-8 option for file/display charset, for better i18n support. It supports character set translation, we just have to teach it how and figure out the CBD (character bundle) format. I already know the BDLREC format, from my lotusdrv project - it's basically a TLV (tag, length, value) encoding system.

sjuswede · Answer 1 · Thu Jun 16 2022 04:18:36 GMT+0800 (China Standard Time)

I sometimes work with Chinese and Japanese text, and if this could get working, I'd be extremely happy.

Right now not even Swedish characters like åäö work for me in rxvt-unicode or XTerm, when I try to enter them. When I import a csv containing them (in UTF8) they predictably get stripped out.

Tavis Ormandy · Answer 2 · Thu Jun 16 2022 08:36:19 GMT+0800 (China Standard Time)

Yeah, not great right now, I can't even use £ lol. I looked at the code a bit today, I think I can make a few improvements easily, some might be harder though!

Internally, lotus uses LMBCS, which is actually pretty impressive foresight considering unicode wasn't invented and everyone else was using codepages. This is good, because internally it can tell the difference between åäa.

You can see it knows about å, and calls it a ring:

https://archive.org/details/lotus-1-2-3-release-3.1-reference/Lotus%201-2-3%20Release%203.1%20-%20Reference/page/n637/mode/2up

It stores these characters correctly but doesn't know how to display them, so right now it uses a "fallback" ascii character translation table (å => a and £ => L, and so on). That actually seems pretty easy to solve, I'll just add a lmbcs => utf-8 table, then pass it to waddch() instead.

I'll give it a shot this weekend.

Tavis Ormandy · Answer 3 · Thu Jun 16 2022 09:18:16 GMT+0800 (China Standard Time)

I think display and keyboard input might be easy, but the question is what to do with /File Import, always assume UTF-8? I guess we could have an environment variable like $LOTUS_IMPORT_CHARSET or whatever.

sjuswede · Answer 4 · Fri Jun 17 2022 13:51:46 GMT+0800 (China Standard Time)

An environment variable would of course be great for legacy files. I would default to UTF-8, since that is standard in Linux today. It's a lot of work to set a normal distro to use anything else. But there are a lot of legacy files out there, and many systems which still spit out very strange formats. Don't ask me how I know.

Tavis Ormandy · Answer 5 · Tue Jun 21 2022 07:14:08 GMT+0800 (China Standard Time)

Okay, I think I've got a plan. I have an easy temporary improvement, and a plan for a harder complete solution.

I can change the keymap code to translate UTF-8 on input to all the supported lmbcs characters. There are no collisions (I checked) so this will be super easy, I can do this in a day or two.

This is easy but not a complete solution -- there's no cjk for a start... but it is better than nothing - most of the latin extended characters are covered (so I'll get £, you'll get all the Swedish characters, things like éßçñ are all there). There is no €, but it has ¤, it seems pretty safe to just steal that for € for now? I don't know.

The complete solution will be adding lmbcs<->UTF-8 charset support, but this is a much bigger job.

krackout · Answer 6 · Tue Jan 03 2023 18:32:59 GMT+0800 (China Standard Time)

The complete solution will be adding lmbcs<->UTF-8 charset support, but this is a much bigger job.

@taviso If any help can be given, I'm willing; especially regarding Greek. It may be a waste of time to get me programmatically involved, but it'll be easier regarding conversion tables I suppose.

Tavis Ormandy · Answer 7 · Thu Jan 05 2023 03:07:12 GMT+0800 (China Standard Time)

Thank you! I'm slowly working on this, it will work eventually! 😆