unicode-org / message-format-wg

Developing a standard for localizable message strings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Discussion] {{Spannables}}

aphillips opened this issue · comments

Per the 2023-11-27 teleconference, this issue is for discussing the design of spannables (also known as open/close/standalone markup).

The design document lives here and should be used as a reference in this discussion.

(chair hat on)

There is a proposal to use option A4 "Hash and Slash" as the design based on a tepid "lack of opposition" consensus in the 2023-11-20 call, so perhaps pay close attention to this option, even though the design document uses the +/-/# sigils in examples. We merged #535 which makes "Hash and Slash" the design option per 2023-11-27 call, but the options available for discussion remain the same.

The foregoing is not a finding of consensus. Our goal will be to choose an approach in the 2023-12-04 call. The more and better we discuss ahead of that, the better.

I'm ok with the currently proposed syntax.

My personal preference order for these choices is:

  1. <foo>…</foo> — In practice, this is the syntax currently used by markup in messages, and it's almost universally recognised as such. And while its origins are in XML, it's currently widely used at least as XML, HTML, and JSX, with slightly different parsing rules in each. I honestly think this would be the least surprising choice for users, and this would explicitly require those who wish to re-parse their output rather than formatting it to parts to do some heavier lifting rather than being able to default to this syntax, should we choose something else.
  2. {+foo}…{-foo} — If we need to go with curly braces, this is pretty good. It's what's in the spec right now, and the +/- pairing is pretty self-evident. Yes, it means that the rules need some fiddling to still allow for the really rare literal negative number operand, but I don't see that as a real cost to humans: -foo and -42 "read" differently, with the first one seeming to have two separate tokens - and foo, while -42 is a single number.
  3. {#foo}…{/foo} — It's fine, and works. It replaces the - problem by reusing what's control flow syntax elsewhere, but maybe that's not such a bad thing? It does mean that others who also wanted to span a pattern with some open/close indicators ended up with this syntax.
  4. {foo}…{/foo} — Unlike with all the other options, there's nothing about a bare {foo} that primes you as the reader to expect it to 1) not render in text and 2) start something. Especially if we still allow numbers so {42} would parse and render as a placeholder. To make this option work, we need to get rid of unquoted literals, or at least the non-numeric ones. And if we do that, then I'd prefer that we use some sigil at the start, and that brings us to one of the two preceding options.
  5. [foo]…[/foo] — If we're going to go beyond curly braces, let's just go with angle brackets. Outside of their use as markup, angle brackets are way less common in real-world text than square brackets (not going to dig up numbers, because it seems no-one else cares about data) and more surprising to need special treatment in syntax.

I don't have a strong stance on standalone markup, except to note that it's much less common in practice than open-close pairs, and that its use cases can be accounted for by either a purpose-built {:function} or an opening element like <foo>. With the {+foo} syntax in particular it does feel a bit clumsy as the + kinda expects the subsequent - to balance out (absolutely a feature, btw), so if we go with that I'd be more open than with the others to considering separate standalone syntax; the design doc includes {#foo} for that alternative.

As far as I can tell, the only place where using the same syntax for open & standalone adds some friction is for source message validators that do not access a registry and which do want to require open-close pairing within each single message. In all other cases, we can rely on the registry, the source message, or the implementation to tell us whether the element is open or standalone.

So for me the cost-benefit analysis of standalone markup makes it a pretty expensive addition providing rather little gain.

@eemeli mentions:

… — In practice, this is the syntax currently used by markup in messages, and it's almost universally recognised as such.

Note that this "works" in our current syntax without any changes, since nothing prevents the literal part of a pattern from containing markup. However, the markup doesn't participate in formatting in any way. Making it participate in formatting would require recognizing sigils < and > and add more escaped to our syntax to account for it.

I generally agree with your other comments.

As far as I can tell, the only place where using the same syntax for open & standalone adds some friction is for source message validators that do not access a registry...

I think it would be useful to add an example, such as "If tool preparing a message for translation adds XLIFF around placeholders, it might need to know if the placeholder is paired or not, as this affects which tags are generated, even if the tool doesn't know the tag set being marked up"

Some observations.

In the design doc, we currently have name for markup productions and I think this should be changed to identifier to ensure that namespaces are permitted:

markup-open  = "#" name ; should be identifier
markup-close = "/" name   ; should be identifier

The primary "disagreement" we have is about the fate of standalone. Hash-and-slash allows open (or close) placeholders to appear unpaired and @eemeli proposes that we just let # be standalone. Assuming that we buy into the cases for separating standalone syntax, the cost of adding standalone is simply one more sigil. Could we agree to choose one more sigil for "standalone" and made it identical to markup-open save for the "standalone" connotation in the data model?

The proposed “hash and slash” solution is acceptable to me precisely due to making room for standalone syntax which doesn’t need a third sigil. So it’s not “just one more sigil” for me; it’s still three.

#542 proposed three solutions to how we can support standalone markup without adding another sigil.

(thinking out loud)

Looking at the use cases in the spannables design this morning with an eye towards the discussion about requirements for the selected design, I see a class of cases where what translators want is:

  • code-like elements to be protected during the translation process--visible and moveable, but not something the translator has to retype
    • when the items are paired and ordered, they should stay in the correct order and enforce open/close
    • when there is something inside the element that needs translation, it should be exposed

That is, translators want tools to produce XLIFF's placeholders) for them. We could code that in our syntax, I suppose:

This has {#bpt}<strong>{/bpt}bold{#ept}</strong>{/ept} needs and
       {#ph}<img alt="{#sub}Translate me!{/sub} href=$url>{/ph}.

This has the benefit that it allows unpaired open or close code while allowing validation that the translation tooling markup is paired and syntactically correct. Formatting to parts can produce single-pass non-reparsed results.

This is the different from what developers want, since it is a PITA to type and difficult to look at--and adds no value to developers (except the deferred benefit of non-borken translations). CAT tools have to process messages anyway and would be better at inserting and removing (and maintaining) this protection than developers.

Some developers won't mind learning a message-specific variation on their code syntax and will want direct participation in rendering (that is, single-step format-and-process). This is mostly what we've been talking about as spannables. The above example could then look like this (using @stasm's #/ markup for standalone):

This has {#html:strong}bold{/html:strong} needs and {#html:img alt=|Translate me?| href=$url /}.

This doesn't quite satisfy what translators want, since it loses a number of checks they'd like to have (and which they get from raw XLIFF processing of HTML or other markup languages). Specifically, the open and close can get out of order without producing an error. To that end, we might want to introduce a non-option expression attributes to help tooling, e.g.:

This has {#html:strong @id=s1}bold{/html:strong @id=s1} and ...

I spy with my little eye another concern that we've somewhat implicitly chose not to address in the 2.0 release: sub-flows, to use the XLIFF term.

In essence, in a message

This has {#html:strong}bold{/html:strong} needs and {#html:img alt=|Translate me?| href=$url /}.

the Translate me? part could (should?) be considered a separate translation unit rather than a literal value that can't contain a variable reference. Are we really okay with this? Or should we leave space for later reconsideration that would allow for something like a .local taking a pattern value?


As for the message in question, my expectation would be that in the real world it ends up either as

This has <b>bold</b> needs and <img alt="Translate me!" href="$url">.

or as

This has {#b}bold{/b} needs and {#img alt=|Translate me?|/}.

In the first case, the developer is formatting to a string and just presumes that HTML will be fine, and that translators will know how to deal with localizable attributes. XSS is a concern that's dealt with Elsewhere™.

In the second case, the developer is formatting to parts and separately merging in the href, and therefore needs to play according to our rules. Their localization uses tools that also need to be MF2-aware.

In neither case do I believe that strings which may include HTML will use an html: namespace.

With the latter case, the "MF2-awareness" of the tools may well be encoded in an MF2-XLIFF transformer, so the translator's view of this string could be something completely different. And for localizable attributes, it might even be able to extract the sub-flow from the parent message.

the Translate me? part could (should?) be considered a separate translation unit rather than a literal value that can't contain a variable reference. Are we really okay with this? Or should we leave space for later reconsideration that would allow for something like a .local taking a pattern value?

One could solve that using a .local, but we don't provide something at the moment.

I am thinking that we should keep our eyes on the XLIFF transform. Curious what you think about using attributes here.

In neither case do I believe that strings which may include HTML will use an html: namespace.

I'm also curious why you think so? A namespace would make visible the type of markup to tooling as well as to the formatter runtime. I know you're mostly thinking about the case in which a data model or "format-to" part is handled by the formatter's caller (rather than as part of formatting), but even there I can see how users will want to plug-in and differentiate different markup regimes. Having a namespace prefix tells me if {#span} is HTML or TTML or something else and provides a hook from which to dangle the implementation code.

If we are comfortable requiring that a single namespace be used for the spanables, that could be in the 'preface' section:

In pseudocode:

.namespace=html5

or

-namespace=html5 scope=spannables

Also, I don't think we need the id=x. The only case where that would be necessary is with 2 identically named items. But even there, I don't think the tooling needs anything. The IDs can be purely internal, derived from the original message:

x{#b stuff1}y{/b stuff2}z{#b stuff3}w{/bstuff4}
=> x{#b stuff1 id=1a}y{/b stuff2 id=1b}z{#b stuff3 id=2a}w{/bstuff4 id=2b}

The tooling would require that the a/b pairs be in order in the translation, but the id numbers can occur in any order.

the Translate me? part could (should?) be considered a separate translation unit rather than a literal value that can't contain a variable reference. Are we really okay with this? Or should we leave space for later reconsideration that would allow for something like a .local taking a pattern value?

One could solve that using a .local, but we don't provide something at the moment.

Are you thinking of some custom sub-syntax-formatter function? Like this:

.local $x = {|Translate $foo here.| :template foo=$foo}
... {#img alt=$x/} ...

With the way we're now going, that'll be a pretty likely outcome.

I am thinking that we should keep our eyes on the XLIFF transform. Curious what you think about using attributes here.

Can you clarify which attributes you're thinking of here?

In neither case do I believe that strings which may include HTML will use an html: namespace.

I'm also curious why you think so?

Because in most cases it's not necessary as systems which may include HTML in their messages will only use HTML for markup. And when something else is needed as well, then namespaces like ttml: will make that easy.

Developers are lazy, and they'll go with {#b} rather than {#html:strong} because the former will work just as well as the latter. They'll know and control how in code the message is used, and how the formatter for the message is called. In practice, it's for the exact same reason why most current localizable messages that include HTML or XML <tags> don't namespace them.

@macchiati

Also, I don't think we need the id=x.

The point of id (and other attributes) would be compatibility with XLIFF, not anything internal to MF2. The id attributes are how XLIFF keeps track of where elements are paired. Other attributes track whether tags can be reordered or removed, etc.

That is, I'm thinking about the problem "how do we enable CAT tools to generate the XLIFF markup the developer intends?" while simultaneously letting developers put markup into messages.

@eemeli

Are you thinking of some custom sub-syntax-formatter function? Like this:

Maybe even less specific than that:

.local $x = {|Translate $foo here| @translate=yes}  // :string implied
{{You have some {#img alt=$x /} in this pattern}}

Developers are lazy, and they'll go with {#b} rather than {#html:strong} because the former will work just as well as the latter.

Yes, that's true. But we should keep an eye out to enabling (not requiring) ways to do more complex things. I've been including namespacing in examples not because I don't think folks will use {#b} when being lazy, but instead thinking about non-lazy cases where namespacing becomes useful. Your comment was close to saying that folks would never use namespaces, which is different from "mostly won't bother with"

I also remain concerned about "two syntaxes in the same message"--I have multiple examples of places where this has bothered me in the past.

Are you thinking of some custom sub-syntax-formatter function? Like this:

Maybe even less specific than that:

.local $x = {|Translate $foo here| @translate=yes}  // :string implied
{{You have some {#img alt=$x /} in this pattern}}

That won't work, because the implicit (custom) :string won't have access to the value of $foo unless it's explicitly passed in as an option.

Your comment was close to saying that folks would never use namespaces, which is different from "mostly won't bother with"

The latter is what I intended to communicate.

I also remain concerned about "two syntaxes in the same message"--I have multiple examples of places where this has bothered me in the past.

Indeed. Which is why I started to wonder whether we should effectively reserve enough space in the syntax for a .local to take a pattern rather than expression value.

In HTML, the lack of syntactic distinction between "open" and "standalone" causes problems and hardcoded lists of elements that can be one or the other. Let's not start a new standard with these problems and hacks.

I don't feel strongly about the particular syntax, whether {#standalone} or maybe even {+-standalone} to save another "sigil". I just feel fairly strongly that we need a syntactic distinction.


Do I understand correctly that "markup" is not going to be in the registry? That makes me nervous. It seems like different organizations will invent different sets of things and how to process them, making messages with markup not-interoperable.

This is the discussion thread for spannables. Keeping it open in spite of merging the design doc.