Run UTN11 normalization on Myanmar text?

Question

Run UTN11 normalization on Myanmar text?

brawer opened this issue 7 years ago · comments

When shaping Myanmar text, could HarfBuzz run UTN 11 normalization similar to how NFC/NFD are handled? @mjansche wrote a Python implementation.

For example, the current version of the Universal Declaration of Human Rights in Mon contains the string သွက်ဂွံဗၠးၜးတိ်တ် which is not in normalized form. But it still got somehow entered by translators, i.e. Myanmar text is floating around in not-normalized form. The normalized form would be သွက်ဂွံဗၠးၜးတ်ိတ်. Here’s an image how HarfBuzz renders the translator-supplied form (left) and the normalized form (right) with the current draft of NotoSansMyanmar; with SIL Padauk, the image looks very similar.

If HarfBuzz could normalize Myanmar strings before shaping, text rendering might get more robust? But it might be too expensive? Or maybe one could only run the normalization algorithm when shaping the original string has produced .notdef glyphs? It smells like a problem for which Ragel could work? (@mjansche: HarfBuzz currently uses Ragel for other finite-state machines).

Roozbeh Pournader · Answer 1 · Sat Jun 10 2017 01:25:00 GMT+0800 (China Standard Time)

Please note that UTN11 is not compatible with how Uniscribe and OpenType Myanmard spec recommend in ordering Myanmar text (which HarfBuzz slightly modifies). HarfBuzz should remain compatible with Uniscribe as much as possible (while of course fixing known bugs about it, which we have done several times in the past).

Also, nobody has really ever reviewed UTN11 normalization in detail. I believe I only looked at it Burmese-language parts, and even that was not in complete sync with what Unicode is recommending for Burmese-language text (which is in sync with Uniscribe/OpenType/HarfBuzz).

Martin Jansche · Answer 2 · Thu Jun 15 2017 03:57:43 GMT+0800 (China Standard Time)

Here's the bigger problem: There are too many different non-equivalent representations of what intuitively could be thought of as equivalent Myanmar text. I'll give an extended example, bear with me.

No matter how broadly or narrowly we define possible grapheme clusters for Myanmar script, I think we can agree that this definitely includes a single letter followed by zero or more marks/modifiers. Consider the following sequence, which occurs in real Burmese text: <101C 103E 102F 1036 1037>. It's easy to check that it consists of a single letter (Lo) followed by four nonspacing marks (Mn). As it appears here, this sequence is in proper UTN11 storage ordering.

There are 4! = 24 permutations of the four modifiers. Among those 24 strings, there is not a single pair of non-identical strings that are canonically equivalent in the usual Unicode sense. Put differently, all 24 permutations are canonically distinct. All 24 strings are in NFC. The UTN11 algorithm however maps all 24 permutations down to exactly one string (the ordering shown above). So while none of them are canonically equivalent, all of them are UTN11-equivalent.

All 24 permutations are also intuitively equivalent. Looking at the characters in <101C 103E 102F 1036 1037> there is really only one intuitive way to assign a visual and phonetic meaning to the whole cluster. Intuitively that cluster should be expected to render as a main glyph for the letter 101C, surrounded by glyphs for the nonspacing marks that modify it. 103E can only go below on the left; 102F can go below to its right, or could optionally appear to the right of the main letter; 1036 can only appear above; and 1037 appears below and to the right of 102F. There is no intuitive room for confusion about how this should render and what it means, in all 24 permutations.

However, all 24 permutations render differently in HarfBuzz at head (76c4873). Crucially, only one storage order (namely the UTN11 ordering) results in a correct rendering without any dotted circles:

I'm claiming we're doing users of Myanmar text (as well as developers of input methods etc.) a disservice by being so restrictive about which ordering is deemed acceptable. UTN11 normalization is useful because it collapses non-canonically-equivalent representations that get confused in practice because they are intuitively equivalent. The exact choice of representation doesn't even matter; if the Uniscribe ordering is easier to work with, that's fine. The main point is that strings that only have one visual representation should ideally render the same.

jfkthame · Answer 3 · Wed Oct 04 2017 18:11:16 GMT+0800 (China Standard Time)

I'm claiming we're doing users of Myanmar text (as well as developers of input methods etc.) a disservice by being so restrictive about which ordering is deemed acceptable.

I'm conflicted about this. Being restrictive about which ordering is acceptable (i.e. renders as the user desires) will encourage more consistent encoding of text (although we know there is existing text that is inconsistent with the "approved" representation), and this will result in more useful (consistently searchable, indexable, interoperable) data. Given that the variant representations are not canonically equivalent in Unicode terms, general-purpose software will not recognize that they are "intuitively equivalent". While accepting "variant spellings" and rendering them identically may make text creation convenient for users, it may also make the resulting text less useful.

Another key question is what Microsoft intends doing in this regard? If we add UTN11-based normalization to harfbuzz, such that any of these "equivalent" (from a user's POV) strings render the same, but only one of them will render correctly in Windows, it's unclear that we're really doing users a service.

Behdad Esfahbod · Answer 4 · Wed Oct 04 2017 18:14:15 GMT+0800 (China Standard Time)

cc @PeterCon

Richard57 · Answer 5 · Sun Oct 08 2017 23:13:21 GMT+0800 (China Standard Time)

While I agree with jfkthame, I would note that there are cases where different, inequivalent encodings that render the same are perfectly legitimate. The different encodings may represent collation differences, e.g. one Welsh letter 'ng' v. two Welsh letters 'ng', though a soft-hyphen as opposed to CGJ is sometimes visible. Different encodings usually represent different pronunciations, e.g. compatibility ideographs for pronunciation variants from KS X 1001:1998, umlaut v. diaeresis for German (the dots differ in some now rare styles), and, dear to my heart, the distinction between the assimilated Tai reflexes of Pali 'tti' (ᨲ᩠ᨲᩥ) and final 'tita' (ᨲᩥ᩠ᨲ) in Tai Tham. Traditional collation may take note of the differences. I've actually made a font (Da Lekh Si) to colour code the Tai Tham differences, and have now got it working in the Firefox spell-checker - I can now see the difference before I choose.