unicode-org / message-format-wg

Developing a standard for localizable message strings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEEDBACK] Improving bidi layout

macchiati opened this issue · comments

In MF1.0, we allow RLM and LRM in any position between syntax tokens that is not part of the literal message; those characters are then ignorable in parsing the message. The advantage of this is that many people will be viewing the message in plaintext editors, and the default results for the bidi algorithm is designed for normal text, and can really mess up "code-like" text (as in MF1.0 or MF2.0).

Here is an example of what happens with BIDI text. This is using a tool (mentioned below) that has the convention that UPPERCASE stands for bidi characters so that people can read them, and I used ⁅...⁆ is used in place of {...} because of the tooling.

Example Source:
* ⁅⁅YOU HAVE ⁅$count⁆ NOTIFICATION.⁆⁆

Resulting Display:
⁅⁅.NOITACIFITON ⁅count :number$⁆ EVAH UOY⁆⁆ *

Notice that "YOU ARE" is reversed & NOTIFICATION is reversed (which is the correct order among the pieces), but the $ is jumbled.

but if the line started with 'one' instead of *, you'd get different results.

one ⁅⁅EVAH UOY ⁅$count :number⁆ NOITACIFITON.⁆⁆

If the syntax allows the insertion of an LRM in locations that are not part of the literal message and ignores those LRM, the source can be consistently left-to-right (so that the syntax is in the right order).

* ⁅⁅YOU HAVE ⁅$count :number⁆ NOTIFICATION.⁆⁆

Resulting Display:

* ⁅⁅EVAH UOY ⁅$count :number⁆ NOITACIFITON.⁆⁆

That is still not good for BIDI languages: the ideal would be that the flow of the syntax was LTR, but the flow of the variant message would be consistently RTL, something like:

* ⁅⁅YOU HAVE ⁅$count :number⁆ NOTIFICATION.⁆⁆

Resulting Display:
⁅⁅.NOITACIFITON ⁅$count :number⁆ EVAH UOY⁆⁆ *

A good "MF2" editor or localization tool would use something more sophisticated (we should call that to people's attention in the spec, pointing to https://www.unicode.org/reports/tr55, especially https://www.unicode.org/reports/tr55/#Ordering for applying HL4).

But a lot of people will be using plain-text editors, and allowing the entry of LRM and RLM can make the syntax more readable. (And tooling can insert those characters in the appropriate places automatically, so that plain text looks better.)

The tool is:
https://util.unicode.org/UnicodeJsps/bidi.jsp?a=one+%E2%81%85%E2%81%85YOU+HAVE+%E2%81%85%24count%E2%81%86+NOTIFICATION.%E2%81%86%E2%81%86&p=Auto&hack=on

Notes:

  • What it does not show is that the brackets ⁅...⁆ are mirrored if displayed R to L, so I did that manually above.
  • If you insert either x or X in the Source (and ignore them in the Reordered output) you can see what the ordering result would be.

LRM and RLM are valid content-char, so if they show up within a pattern the syntax is completely fine with that. But is this about asking for them to be allowed (and ignored?) outside patterns as well, e.g. among variant keys?

What I think Mark is trying to ask for is that optional whitespace inside expressions be permitted to contain bidi controls (tools should prefer isolating paired controls for this--specifically LRI/PDI).

Pattern exteriors are not a problem because the {{ and }} do mirror, but constructs outside of patterns will have mixed direction display issues. The overall syntax is not very bidi friendly, so this message looks badly formed when shown in an LTR context, but is, in fact, syntactically correct:

image

The same message presented RTL is surprisingly "backwards correct" (but only because I didn't use any LTR options/values or names):

image

We can't just add bidi controls to the whitespace production s because the controls are invisible and would interfere with visual parsing. They need to be optional.

Current:

s = 1*( SP / HTAB / CR / LF / %x3000 )

Modified:

s    = *1(%x200E / %x200F / %x2066 / %x2067 / %x2068 / %x2069) 
       1*( SP / HTAB / CR / LF / %x3000 ) 
       *1(%x200E / %x200F / %x2066 / %x2067 / %x2068 / %x2069)

The other alternative would be to fix statements and expressions. This requires making the isolate controls optional on the brackets in pairs. Something modeled like the following (note that expression here compresses three productions in the real ABNF):

expression = "{" [s] (literal / variable) [s annotation] *(s attribute) [s] "}"
           / "{" [s] (s annotation) *(s attribute) [s] "}"
           / "{%x2066" [s] (literal / variable) [s annotation] *(s attribute) [s] "%x2069}"
           / "{%x2066" [s] (s annotation) *(s attribute) [s] "%x2069}"

(The above does not show how statements need modification.)

@eemeli asked:

LRM and RLM are valid content-char, so if they show up within a pattern the syntax is completely fine with that. But is this about asking for them to be allowed (and ignored?) outside patterns as well, e.g. among variant keys?

Yes, that's what he's asking for. Note that these characters are not valid name chars. This also makes them illegal in unquoted literals.

I'll add that one does not want to insert these characters into the text part of a pattern for purpose of making the expressions look good, because they would be part of the output and interfere with the actual message display in bad ways. That's why I put the isolating controls inside the brackets above.

Yes, that's the ask. I marked this as the Preview-Feedback because I don't think this is urgent enough to require addressing in v45. (BTW, @markusicu was the one who caught this.) We might put a note in to the effect that bidi issues will be reviewed during the tech preview.

An expansion of %x200E / %x200F / %x2066 / %x2067 / %x2068 / %x2069 with names is below. (Handy for people who haven't got the codes memorized.)

Code point Name
U+200E LEFT-TO-RIGHT MARK
U+200F RIGHT-TO-LEFT MARK
U+2066 LEFT-TO-RIGHT ISOLATE
U+2067 RIGHT-TO-LEFT ISOLATE
U+2068 FIRST STRONG ISOLATE
U+2069 POP DIRECTIONAL ISOLATE

I think the approach of building into the grammar just enough to produce the right effects is very promising; that has the promise of detecting and preventing mistakes. There are some complications that will need enough time to explore, so we definitely don't want to rush this.

Note: The error messages for malformed bidi controls needs to be clear, because the characters are normally invisible. A simple approach like "error X on line 10 index 15" (aside from being extremely unfriendly in the first place) is especially not recommended for any line that contains RTL characters or isolates.

@macchiati called my attention to this; I have skimmed the discussion more quickly than is advisable, and I am too unfamiliar with the context and too busy with a host of other things to be of very much use here, but I should like to point people to the bits of standards I edit that at a glance seemed relevant to me.

Mark wrote:

allowing the entry of LRM and RLM can make the syntax more readable.

See https://www.unicode.org/reports/tr31/#Whitespace. Addison also wrote about isolates, which may make sense here, since this seems like a case of https://www.unicode.org/reports/tr55/#Interspersed-Syntax, for which the marks are not enough; these are default ignorable, so adding them to the set of whitespace characters as part of a profile with no other qualifications would put them in the set of ignorable format controls.

In this context, I would tend to think having the isolates as ignorable format controls is sensible.

(And tooling can insert those characters in the appropriate places automatically, so that plain text looks better.)

On this see https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text. If you want this to insert isolates it would have to be beefed up a little.

I would recommend, whatever you do, to include a conformance statement for https://www.unicode.org/reports/tr31/#R3a, so that there well-understood framework for reviewers and implementers to work with.

@eggrobin Thanks for this.

I would recommend, whatever you do, to include a conformance statement for https://www.unicode.org/reports/tr31/#R3a, so that there well-understood framework for reviewers and implementers to work with.

I'm not sure how R3 helps? The problem here is that our syntax (which can be single line) uses neutral characters, including some mirroring characters, and allows non-ASCII (including both weak and strong characters in either direction) in identifiers and values.

If you want this to insert isolates it would have to be beefed up a little.

Actually, beefed up a lot. The pointer to "Conversion to Plain Text" is helpful.

In this release, I think we might add LRM as an optional permitted character before the whitespace production in the syntax. This would allow tools to implement this. We could provide a pointer to this algorithm for users as well.

Current:

; Whitespace
s = 1*( SP / HTAB / CR / LF / %x3000 )

Proposed:

; Whitespace, optionally prepended with LRM
;    see: https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text
s = [%x200E] 1*( SP / HTAB / CR / LF / %x3000 )

I'm not sure how R3 helps?

Because it allows LRMs optionally by default, next to whitespace and in similar places; but more broadly because it provides a standardized terminology and baseline for definitions of formal syntaxes to describe what kind of invisible characters they allow optionally next to whitespace (and in related places).

In this release, I think we might add LRM as an optional permitted character before the whitespace production in the syntax. This would allow tools to implement this.

Again, this is starting to sound a lot like item 2 of UAX31-R3a; see also https://www.unicode.org/reports/tr31/#Contexts_for_Ignorable_Format_Controls which it references.
The point of introducing the concept of ignorable format controls was to make it possible to implement that algorithm or ones in a similar vein (the recent changes to UAX31, which include the split of R3 into R3a and R3b and the expansion of Section 4, were made concurrently with the development of UTS55, by the same people, including yours truly; these two documents are designed to work with each other).

I think the goal of allowing bidi marks to make it easier to understand the structure of the data is a worthwhile goal.

But we should also make sure that this doesn't open the door to Trojan source attacks. While MF2 maybe has less attack surface for Trojan source attacks than the average programming language, I don't think it's zero.

I think we should add a statement about R3a conformance. We are specifically an R3a-2 implementation, since we don't allow all of the whitespace characters in R3a-1. We would need to add LRM and maybe RLM using the syntax I proposed above. This isn't a question so much of Trojan source as it is one of letting tools make RTL messages legible to humans without interfering with the output.

@duerst I agree that the attack surface is non-zero.

Although the PR for this exists, the WG decided not to include this in LDML45 because it is late-breaking. See meeting notes here