Text segmentation for TS port

Question

Text segmentation for TS port

afontenot opened this issue 3 years ago · comments

Hi, thanks for your work on the QR code generator, it's great!

While I already have a Linux native QR code tool that I use, I often turn to yours on the web when I need to quickly make a QR code or when I'm making a recommendation to someone else for what they should use. However, the web demo is a bit limited by the fact that it doesn't contain the text segmentation algorithm of the Java version. Missing this feature can make a significant difference in efficiency, for example when encoding SMART Health Cards. These look like shc:/[~1500 digits].

I noticed you already have a TypeScript implementation of your algorithm. I was considering forking your TypeScript QR generator and adding the text segmentation algorithm, but then I realized that the Typescript implementation is not under a free license, so the result would not be distributable.

I wonder if you would consider adding this feature to the TS QR code generator. I could of course reimplement it myself working off the (freely licensed) Java version, but that seems a bit pointless when you've already done the work of writing it in TS and you might be willing to add it.

(I'd also appreciate a setting in the demo allowing the user to choose which encoding mode to use, but that's not necessary for the above feature to be useful.)

Nayuki · Answer 1 · Sat Jan 15 2022 04:41:16 GMT+0800 (China Standard Time)

Thanks for your detailed write-up! You covered many observations that are relevant to the topic, and by and large I agree with you.

You're right, my main web page doesn't implement the optimal text segmentation algorithm. Even though I did the optimal algorithm with no audience in mind (actually I did this library with no audience in mind), in the past ~2 years I heard more buzz about SMART Health Cards needing custom segmentation.

My implementation of optimal segmentation on a web page is non-free because it is intended solely as a demo and nothing more. Half of the code is entangled in UI considerations rather than the core algorithm. Attempting to adapt the code to work with the existing qrcodegen library would require careful study and editing to ensure that it agrees with technical details elsewhere in the library and that it is correct for all possible inputs. (If you do some deep searching, other people have ported my optimal segmentation algorithm to other languages... I think Rust, and maybe one or two more.)

In the short term, I'm thinking of removing QrCode.encodeBinary() because I don't think anyone just dumps raw bytes into a QR Code. If they do want to do it, they can make a custom segment with a few more lines of code.

Currently, QrCode.encodeText() calls QrSegment.makeSegments(), which analyzes the whole string to pick a single mode. Ideally, makeSegments() should not exist, and makeSegmentsOptimally() should be the only automatic segmenter available.

But I'm apprehensive about optimal segmentation for a couple of reasons:

Optimal segmentation requires ~150 lines of code whereas simple mode detection requires ~10 lines.
The dynamic programming algorithm for optimal segmentation is not easy to understand or audit for correctness - both in terms of optimality and stuff like not losing or corrupting single characters.
However, this library already has a bunch of necessarily non-obvious things like finite field arithmetic, polynomial division for Reed-Solomon encoding, the zigzag scan, etc.
The kanji-mode encoder needs a big lookup table to translate between Unicode and Shift JIS. I implemented this to demonstrate it is feasible, but I don't think it is a popular or useful feature to justify the code size. I think doing optimal segmentation among {byte, alphanumeric, numeric} modes is still reasonable.

If you desperately need segmentation in the short term, then either do it manually for the specific application (e.g. Alphanumeric "SHC:/" + Numeric "0123"), or take on the full the burden and risk of porting my Java algorithm.

As time permits, I'll try porting the optimal segmentation algorithm to all my languages (except "c" and "rust-no-heap" due to no-heap), and stabilize the functionality/API/documentation. Be warned that this could take months.