unicode-org / message-format-wg

Developing a standard for localizable message strings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extra spaces in markup

mihnita opened this issue · comments

Our current grammar for markup is this:

markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] "}"  ; open
       / "{" [s] "#" identifier *(s option) *(s attribute) [s] "/}" ; standalone
       / "{" [s] "/" identifier *(s option) *(s attribute) [s] "}"  ; close

That's what it looks like after adding adding options to close.
But the issue is independent.

I would like to make the case that the first space after { is not only unnecessary, but somewhat problematic.

  1. It forces us into a (possibly big) lookahead.
    We see the {, and then we might have to "consume" a lot of spaces to know where we are (markup or expression)

    I decided to strikethrough because it is the least compelling argument. It is not about the parsing, at all. [mihnita]
  2. The open / close attribute is something associated to the whole markup, not the identifier.
    It is this whole marker that is standalone or close, not the attribute
  3. It is somewhat internally inconsistent.
    The / at the end of the standalone is tied with the closing }, there is no [s] in between them.
    This was true before PR #649, but it is more visible now.
  4. It is even invalid in HTML and XML.
    None of this works: < b> and < /b> and </ b>

For convenience I will include the HTML below unprotected, so that we can see it in the issue:

<p>Say something <i>italic</i> and last (works).</p>
<p>Say something <  i>italic</i> and last (fails).</p>
<p>Say something <i  >italic</i> and last (works).</p>
<p>Say something <i>italic<  /i> and last (fails).</p>
<p>Say something <i>italic</  i> and last (fails).</p>
<p>Say something <i>italic</i  > and last (fails).</p>

Say something italic and last (works).

Say something < i>italic and last (fails).

Say something italic and last (works).

Say something italic< /i> and last (fails).

Say something italic and last (fails).

Say something italic and last (fails).

Proposed change:

markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] "}"  ; open
       / "{" [s] "#" identifier *(s option) *(s attribute) [s] "/}" ; standalone
       / "{" [s] "/" identifier *(s option) *(s attribute) [s] "}"  ; close

to:

markup = "{#" identifier *(s option) *(s attribute) [s] "}"  ; open
       / "{#" identifier *(s option) *(s attribute) [s] "/}" ; standalone
       / "{/" identifier *(s option) *(s attribute) [s] "}"  ; close

Note: this is not the same as the expressions.
So there is no need to be consistent, because the similarity is only superficial.
They might look the same, but the meaning is not the same.

There detecting { tells us we are in an expression, no need for extra lookahead.
And the "decorations" { |foo| } or { $foo } or { :foo } DO belong to "foo".
It is "foo" that is a literal / operand / function, not the expression itself.

I think this is a non-starter? Our syntax is very consistent about optional whitespace: we are very lax when it's optional, especially inside expressions. The {/} are never part of any construct.

I think the thing that might be confusing here is that when you write the syntax using identifier directly in the markup production (which I think is a natural thing to do), so you don't have the insulation that function or annotation do, in which the sigil is clearly attached to the identifier:

markup = "{" markup-identifier *(s option) *(s attribute) [s] "}" <- this doesn't work because of standalone
markup-identifier = ( "#" / "/" ) identifier

All of our other identifiers and sigil-introduced tokens are of the sigil-identifier flavor. For consistency, this should be to. There is some lookahead to find type, but it's consuming whitespace (so can be optimized).

Our syntax is very consistent about optional whitespace: we are very lax when it's optional

Agree, but I don't think this is optional, and the sigils should not be attached to the identifier.
They mark the start / end of the markup.
In the same class as {{ and }}, where we don't allow spaces between the curly braces.
And with the end of standalone, where we don't allow space between / and }.

Right above your comment I explain why I think this is not at all the same.

the thing that might be confusing here is that when you write the syntax using identifier directly in the markup production

I am not judging this based on the grammar, I am judging it as a user seeing the syntax.

Start, {#foo}some text{/foo} and more.

What I see is {# ..... }, delimiters.

These are spaces that I agree are optional:

Start, {#   foo    }some text{/    foo   } and more.

{# marks the beginning of an open / standalone marker,
{/ the beginning of a close marker,
and /} marks the end of a standalone marker.
They are delimiters, same as {{...}}, or |...|.
One "squints" at {/....} and says: ah, this is a closing marker. The / is not a sigil of the name part.

There is not markup (that I know of) that allows these kind of spaces.
If there is one, I would like to see it.

I think it's way too late to introduce this syntax change for consideration, and I do not think we should consider this for LDML 45.

As an implementer, I'd like to note that I had no difficulty dealing with the current syntax. In a pattern, from the { you can see that you have some expression or markup starting, and then you need to look past any subsequent whitespace to determine what that is.

As an implementer, I'd like to note that I had no difficulty dealing with the current syntax

This is not about implementing. That is point 1 of 4, and in fact the weakest one.
Should not have been the first one (or included at all, probably)

It is that it does not make sense semantically, as a user, and no other system does that.

We agreed to the syntax for 45 in the F2F. I'm going to mark this for consideration during tech preview.

BTW, if we are to be "loose" with the spaces, why not allow spaces between the # and identifiers?

Current:

"{" [s] "#" identifier *(s option) *(s attribute) [s] "}" 

Extra space:

"{" [s] "#" [s] identifier *(s option) *(s attribute) [s] "}" 

The current syntax allows ...{ /bold }... but not ...{/ bold }...
Why make it mandatory to "glue" the / and # on the identifier?

I think this is all based on a superficial (visual) similarity with the placeholders.

I'd also just like to chime in that I agree with @mihnita - restricting the syntax here to ensure that there is no whitespace between { and # and { and /, aligning with the standalone close token (/}) would be great - it is visually much clearer than { # and { /

I think some care should be exercised about how we discuss the whitespace here. If one just looks at the sigil, the spaces looks weird. But notice that in the remainder of our syntax, the sigil is attached to something, e.g. .keyword, :function, $variable, |literal|. So one argument would be that the # and / sigils should attach to the markup identifier:

{   #strong   }

The counter argument would seem to be that markup is a fundamentally different type of expression, so it's not really a sigil, it's a different introducing sequence:

{#   strong   }

Looking at what HTML does is not really instructive, since any HTML would be produced from the markup syntax and we need to think about what other syntax's needs are as well.

I'd be curious how markup is currently parsed by implementations? Attaching the sigil to the starter would require a one character lookahead in each expression to check if the expression is markup. Attaching the sigil to the identifier would be more similar to seeking the next token (see list above). That doesn't mean that the lookahead is evil. I'm just curious if the difference in parsing is worth it.

I'd be curious how markup is currently parsed by implementations?

I parse markup together with expressions. Because the constructions are syntactically so similar, it's easier to have just one handler for the stuff between curly braces.

As a user, I would find variance between the whitespace requirements of expressions and markup very confusing. I don't find the /} confusing, because it's the only such terminal syntax we have besides } and }}.

My parser does the lookahead after consuming the '{' -- if the next character is '#' or '/' a separate parseMarkup() function is called, which reuses the options and attributes parsing code but is separate from parseExpression(). It wouldn't cause a problem for me to discard whitespace between '#' and the identifier. I don't have a strong opinion on this, though.