Extra spaces in markup

Question

Extra spaces in markup

mihnita opened this issue 5 months ago · comments

Our current grammar for markup is this:

markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] "}"  ; open
       / "{" [s] "#" identifier *(s option) *(s attribute) [s] "/}" ; standalone
       / "{" [s] "/" identifier *(s option) *(s attribute) [s] "}"  ; close

That's what it looks like after adding adding options to close.
But the issue is independent.

I would like to make the case that the first space after { is not only unnecessary, but somewhat problematic.

It forces us into a (possibly big) lookahead.
We see the {, and then we might have to "consume" a lot of spaces to know where we are (markup or expression)
I decided to strikethrough because it is the least compelling argument. It is not about the parsing, at all. [mihnita]
The open / close attribute is something associated to the whole markup, not the identifier.
It is this whole marker that is standalone or close, not the attribute
It is somewhat internally inconsistent.
The / at the end of the standalone is tied with the closing }, there is no [s] in between them.
This was true before PR #649, but it is more visible now.
It is even invalid in HTML and XML.
None of this works:  and  and

Mihai Nita · Answer 1 · Thu Feb 15 2024 11:12:07 GMT+0800 (China Standard Time)

For convenience I will include the HTML below unprotected, so that we can see it in the issue:

<p>Say something <i>italic</i> and last (works).</p>
<p>Say something <  i>italic</i> and last (fails).</p>
<p>Say something <i  >italic</i> and last (works).</p>
<p>Say something <i>italic<  /i> and last (fails).</p>
<p>Say something <i>italic</  i> and last (fails).</p>
<p>Say something <i>italic</i  > and last (fails).</p>

Say something italic and last (works).

Say something italic and last (fails).

Say something italic and last (works).

Say something italic and last (fails).

Say something italic and last (fails).

Say something italic and last (fails).

Mihai Nita · Answer 2 · Thu Feb 15 2024 11:16:19 GMT+0800 (China Standard Time)

Proposed change:

markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] "}"  ; open
       / "{" [s] "#" identifier *(s option) *(s attribute) [s] "/}" ; standalone
       / "{" [s] "/" identifier *(s option) *(s attribute) [s] "}"  ; close

to:

markup = "{#" identifier *(s option) *(s attribute) [s] "}"  ; open
       / "{#" identifier *(s option) *(s attribute) [s] "/}" ; standalone
       / "{/" identifier *(s option) *(s attribute) [s] "}"  ; close

Mihai Nita · Answer 3 · Thu Feb 15 2024 11:21:01 GMT+0800 (China Standard Time)

Note: this is not the same as the expressions.
So there is no need to be consistent, because the similarity is only superficial.
They might look the same, but the meaning is not the same.

There detecting { tells us we are in an expression, no need for extra lookahead.
And the "decorations" { |foo| } or { $foo } or { :foo } DO belong to "foo".
It is "foo" that is a literal / operand / function, not the expression itself.

Addison Phillips · Answer 4 · Thu Feb 15 2024 12:02:30 GMT+0800 (China Standard Time)

I think this is a non-starter? Our syntax is very consistent about optional whitespace: we are very lax when it's optional, especially inside expressions. The {/} are never part of any construct.

I think the thing that might be confusing here is that when you write the syntax using identifier directly in the markup production (which I think is a natural thing to do), so you don't have the insulation that function or annotation do, in which the sigil is clearly attached to the identifier:

markup = "{" markup-identifier *(s option) *(s attribute) [s] "}" <- this doesn't work because of standalone
markup-identifier = ( "#" / "/" ) identifier

All of our other identifiers and sigil-introduced tokens are of the sigil-identifier flavor. For consistency, this should be to. There is some lookahead to find type, but it's consuming whitespace (so can be optimized).

Mihai Nita · Answer 5 · Thu Feb 15 2024 13:37:20 GMT+0800 (China Standard Time)

Our syntax is very consistent about optional whitespace: we are very lax when it's optional

Agree, but I don't think this is optional, and the sigils should not be attached to the identifier.
They mark the start / end of the markup.
In the same class as {{ and }}, where we don't allow spaces between the curly braces.
And with the end of standalone, where we don't allow space between / and }.

Right above your comment I explain why I think this is not at all the same.

the thing that might be confusing here is that when you write the syntax using identifier directly in the markup production

I am not judging this based on the grammar, I am judging it as a user seeing the syntax.

Start, {#foo}some text{/foo} and more.

What I see is {# ..... }, delimiters.

These are spaces that I agree are optional:

Start, {#   foo    }some text{/    foo   } and more.

{# marks the beginning of an open / standalone marker,
{/ the beginning of a close marker,
and /} marks the end of a standalone marker.
They are delimiters, same as {{...}}, or |...|.
One "squints" at {/....} and says: ah, this is a closing marker. The / is not a sigil of the name part.

There is not markup (that I know of) that allows these kind of spaces.
If there is one, I would like to see it.

Eemeli Aro · Answer 6 · Thu Feb 15 2024 16:43:15 GMT+0800 (China Standard Time)

I think it's way too late to introduce this syntax change for consideration, and I do not think we should consider this for LDML 45.

As an implementer, I'd like to note that I had no difficulty dealing with the current syntax. In a pattern, from the { you can see that you have some expression or markup starting, and then you need to look past any subsequent whitespace to determine what that is.

Mihai Nita · Answer 7 · Fri Feb 16 2024 04:04:02 GMT+0800 (China Standard Time)

As an implementer, I'd like to note that I had no difficulty dealing with the current syntax

This is not about implementing. That is point 1 of 4, and in fact the weakest one.
Should not have been the first one (or included at all, probably)

It is that it does not make sense semantically, as a user, and no other system does that.

Addison Phillips · Answer 8 · Mon Feb 19 2024 02:25:46 GMT+0800 (China Standard Time)

We agreed to the syntax for 45 in the F2F. I'm going to mark this for consideration during tech preview.

Mihai Nita · Answer 9 · Thu Feb 22 2024 13:11:52 GMT+0800 (China Standard Time)

BTW, if we are to be "loose" with the spaces, why not allow spaces between the # and identifiers?

Current:

"{" [s] "#" identifier *(s option) *(s attribute) [s] "}"

Extra space:

"{" [s] "#" [s] identifier *(s option) *(s attribute) [s] "}"

The current syntax allows ...{ /bold }... but not ...{/ bold }...
Why make it mandatory to "glue" the / and # on the identifier?

I think this is all based on a superficial (visual) similarity with the placeholders.

Luca Casonato · Answer 10 · Sun Jun 30 2024 22:42:21 GMT+0800 (China Standard Time)

I'd also just like to chime in that I agree with @mihnita - restricting the syntax here to ensure that there is no whitespace between { and # and { and /, aligning with the standalone close token (/}) would be great - it is visually much clearer than { # and { /

Addison Phillips · Answer 11 · Mon Jul 01 2024 01:03:09 GMT+0800 (China Standard Time)

I think some care should be exercised about how we discuss the whitespace here. If one just looks at the sigil, the spaces looks weird. But notice that in the remainder of our syntax, the sigil is attached to something, e.g. .keyword, :function, $variable, |literal|. So one argument would be that the # and / sigils should attach to the markup identifier:

{   #strong   }

The counter argument would seem to be that markup is a fundamentally different type of expression, so it's not really a sigil, it's a different introducing sequence:

{#   strong   }

Looking at what HTML does is not really instructive, since any HTML would be produced from the markup syntax and we need to think about what other syntax's needs are as well.

I'd be curious how markup is currently parsed by implementations? Attaching the sigil to the starter would require a one character lookahead in each expression to check if the expression is markup. Attaching the sigil to the identifier would be more similar to seeking the next token (see list above). That doesn't mean that the lookahead is evil. I'm just curious if the difference in parsing is worth it.

Eemeli Aro · Answer 12 · Mon Jul 01 2024 18:44:57 GMT+0800 (China Standard Time)

I'd be curious how markup is currently parsed by implementations?

I parse markup together with expressions. Because the constructions are syntactically so similar, it's easier to have just one handler for the stuff between curly braces.

As a user, I would find variance between the whitespace requirements of expressions and markup very confusing. I don't find the /} confusing, because it's the only such terminal syntax we have besides } and }}.

Tim Chevalier · Answer 13 · Fri Jul 19 2024 05:15:03 GMT+0800 (China Standard Time)

My parser does the lookahead after consuming the '{' -- if the next character is '#' or '/' a separate parseMarkup() function is called, which reuses the options and attributes parsing code but is separate from parseExpression(). It wouldn't cause a problem for me to discard whitespace between '#' and the identifier. I don't have a strong opinion on this, though.