unicode-org / message-format-wg

Developing a standard for localizable message strings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Complex message start syntax is too strict

eemeli opened this issue · comments

While working on a presentation about the JS implementation, I encountered the following error when testing some of the code I'd written for the slides:

const msg = `
  .local $currency = {EUR}
  .input {$price :number style=currency currency=$currency}
  {{Total price: {$price}}}`;
const mf = new MessageFormat(msg, 'en');
// Syntax parse error: parse-error at [the first { of '{{Total...']

After spending quite a while debugging the implementation code to determine the error here, I came to the conclusion that my parser was correct, the above is indeed a syntax error.

The problem with this message is that the leading whitespace before the first . means that it isn't special, and all of the declaration syntax parses as a valid simple pattern, up until the {{ pattern start.

I mentioned this problem on our Slack channel, where @stasm was also unable to initially see the problem. I think if we can't see it, other developers won't either, and that this will become a significant stumbling block.

We should consider amending the ABNF to allow whitespace before the . at the beginning of a complex message.

If we do something to fix this, we'll need to confront the problem of pattern initial whitespace. The simplest fix would be to change the ABNF to allow non-significant whitespace at the start of the message. Here's the current ABNF:

message = simple-message / complex-message
simple-start-char = content-char / s / "@" / "|"

And here's the modifications:

message = [s] (simple-message / complex-message)
simple-start-char = content-char / "@" / "|"

The primary side-effect of this change is that, if you want spaces at the start of a simple messages, you must quote the pattern.

A different way to address this would be to allow optional whitespace at the start of complex-message. The challenge here is that the message parser has to look ahead until it sees a non-space character. If that character is . or if it is { followed by { (i.e. {{), the message is complex and the whitespace is discarded. Otherwise the whitespace is retained.

complex-message = [s] *(declaration [s]) complex-body

I think @eemeli's experience is not invalid, since we play loosely with whitespace elsewhere. I might have a slight preference for the second (lookahead) solution. Let's discuss in the 2024-01-22 call.

I remember raising this concern multiple times, including a flavor of it in the summary it at #512 (comment) — relevant excerpt from linked comment:

I feel strongly that iterative updates to a message which change its classification (e.g., adding or removing declarations) should not affect treatment of leading or trailing whitespace—cf. #474 (comment)

 logOutMessage = ```
-{%input username}
-
-Log out {$username}?
+Log out?
 ```;

And as far as I can tell, the concern was ultimately rejected, albeit without the requested explicit acknowledgment. Is it to be reconsidered?

I also think the current .foo is a problem.

Looking at the spec:

message = simple-message / complex-message
simple-message = [simple-start pattern]
simple-start = simple-start-char / text-escape / placeholder
simple-start-char = content-char / s / "@" / "|"
text-escape     = backslash ( backslash / "{" / "}" )
content-char      = %x00-08        ; omit HTAB (%x09) and LF (%x0A)
                  / %x0B-0C        ; omit CR (%x0D)
                  / %x0E-19        ; omit SP (%x20)
                  / %x21-2D        ; omit . (%x2E)
                  / %x2F-3F        ; omit @ (%x40)
                  / %x41-5B        ; omit \ (%x5C)
                  / %x5D-7A        ; omit { | } (%x7B-7D)
                  / %x7E-D7FF      ; omit surrogates
                  / %xE000-10FFFF

How can I say ".input is in important keyword in MF2", and render the .input?

The current solution is to wrap the whole string in {{...}}
And that makes it a complex message.

At this point we can all say "I told you so!" :-)
But we will all use this problem as an argument for our own old proposal:
tinker with trimming spaces, start in code mode, special marker to enter code mode, etc.

But trying to look at this with fresh eyes, I think the way we enter "complex mode" is too ... complex :-)
Now a complex message starts with .foo OR {{. If not, it is a simple message.

I had a hard time to figure out from the ebnf if a message is complex or not.
Is ".foo bar" complex?
The ebnf fragment quoted above is not enough to determine that.
I also need to look at the definition of placeholder.
And after going through all of that, I decide it is not simple.

Compare for example with some initial marker for complex messages:

message = "{!}" complex-message / simple-message

I can almost instantly see if a message is complex or not.
Same as seeing #! at the beginning of a file I know it is some kind of script, and what kind of script.

Note: I'm not saying that's the way to go (initial marker).
Maybe it is, maybe not.

What I'm saying is that there are more options than trim leading spaces and quote.
"optional whitespace at the start of complex-message" only makes the detection of complex messages harder.
It is already hard.

We should consider amending the ABNF to allow whitespace before the . at the beginning of a complex message.

I agree it's a problem, I don't like jumping to this solution as the only one.

In the 2024-01-22 call we agreed to place this out of scope for LDML45 and to track this issue during tech preview. I'm using the label out-of-scope? for now to represent that choice.

Dropping the out-of-scope? label as that was about the LDML54 release.