Some red flags with the current grammar for reserved-statement

Question

Some red flags with the current grammar for reserved-statement

mihnita opened this issue 5 months ago · comments

Reserved statements:

reserved-statement = reserved-keyword [s reserved-body] 1*([s] expression)
reserved-keyword   = "." name
reserved-body      = *([s] 1*(reserved-char / reserved-escape / quoted))
reserved-char     = content-char / "."

If I read this correctly, then these strings are valid:

// reserved-keyword   reserved-body   expression
   .foo
   .foo               .               {$baz}
   .foo               ....            {$baz}
   .foo               .bar            {$baz}

Do we really want to allow '.' in the reserved body?

Escaped text:

text-escape     = backslash ( backslash / "{" / "}" )
quoted-escape   = backslash ( backslash / "|" )
reserved-escape = backslash ( backslash / "{" / "|" / "}" )

To me this looks like a red flag.
Not only for the implementers of a parser, but also for the users of the final syntax.

In most cases escaped strings can be handled pretty low level (tokenizer).

But in this case knowing that we can now accept reserved-escape we need a lot of context.
And that allows us to support strings like this:

.foo .\{\|\} {$baz}

So I find the reserved-statement to be too lax.

Or I am reading the grammar wrong?

Addison Phillips · Answer 1 · Wed Feb 14 2024 08:13:47 GMT+0800 (China Standard Time)

I think you're reading the grammar fine, but drawing an perhaps-incorrect conclusion.

So I find the reserved-statement to be too lax.

Too lax for what purpose? We deliberately made the grammar of reserved-body permissive, knowing that almost certainly rational people will not use it for weird things such as your example.

The { must be quoted so that we can recognize expression start. The | is quoted to allow quoted to work in reserved-body.

Basically this says "consume characters until you see an (unescaped) {, which starts the required expression sequence. If anything, reserved-body is too strict, in that it requires | to be paired (for quoted).

How would you change it? We need to freeze the syntax.

Addison Phillips · Answer 2 · Mon Feb 19 2024 02:23:57 GMT+0800 (China Standard Time)

@mihnita I'm sorry I missed bringing this up in the F2F call. Can you clarify the request or, if appropriate, close this issues if it has been addressed? Thanks.

Addison Phillips · Answer 3 · Wed Feb 28 2024 23:26:29 GMT+0800 (China Standard Time)

Per discussion in 2024-02-26 call, closing

Mihai Nita · Answer 4 · Tue Mar 05 2024 01:09:12 GMT+0800 (China Standard Time)

This is how our escaping of { | } looks like in different contexts:

Normal | \{ \}, but {  |quoted \| { }| &reserved \| \{ \} but |quoted inside! \| { } | out, so \| \{ \}} back to normal.
11111111111111111111 22 3333333333333 2222222222 444444444444  5555555555555555555555  4444444444444444 1111111111111111

Legend:
1: normal text (text-char = content-char / s / "." / "@" / "|")
2: inside a placeholder
3: quoted literal (quoted-char = content-char / s / "." / "@" / "{" / "}")
4: reserved (reserved-char = content-char / ".")
5: quoted literal inside a reserved

When writing a parser it is common to write a tokenizer that groups characters into tokens, and then the parser works at the higher level, on tokens.

But to write a tokenizer it is best to not need much context, just consume characters.

On occasions the tokenizer might enter a different mode, for example in strings ("...\"..."), block comment (/*...*/) or line comment (//...).

But these are one single level mode changes.
When I enter string mode, we stay there (with \") all the way to the last ".
We don't look for // or /* in a string.
Same, when we are in block comment mode we don't care about \" or //. And same for line comments.

For our grammar the tokenizer must know when we are inside a reserved section, but to get there it needs a { (to ender placeholder mode), then a &foo to enter reserved mode.
So it has to look for relatively high level information (concepts like "placeholder"), that belong at the parser level.

Worse, inside a reserve sequence we are still looking (and recognizing) a literal (block 5 above).
We need to know we are in a literal, in a reserved, in a placeholder.

This is unnecessary complex.
Most programming languages don't need such a tokenizer.

Our grammar forces the tokenizer to:

go deep
require context from parser

"I wrote a parser for it" is not an argument.
People have written parsers for bad grammars before.
Being able to write a parser does not mean the grammar is good.

And think what this does for our users.
They must understand all of these rules, when to escape what.

Mihai Nita · Answer 5 · Tue Mar 05 2024 01:11:59 GMT+0800 (China Standard Time)

Per discussion in 2024-02-26 call, closing

I think this was a misunderstanding.
I though that we drop it as in "not a blocker for LDML 45"
Not close it completely.

Mihai Nita · Answer 6 · Tue Mar 05 2024 01:21:01 GMT+0800 (China Standard Time)

Kind of related, I think that there is no way to use @ inside a reserved body:

reserved-char   = content-char / "."
reserved-escape = backslash ( backslash / "{" / "|" / "}" )

Bruno Haible · Answer 7 · Wed Mar 20 2024 23:36:31 GMT+0800 (China Standard Time)

When writing a parser

You mean writing a left-to-right (from the beginning to the end) parser, right?

I will concede that for our grammar, writing a parser that starts somewhere in the middle and needs to understand the context by looking left, is not so immediate as writing a left-to-right parser. But that is probably true for many (most) language grammars.

it is common to write a tokenizer that groups characters into tokens, and then the parser works at the higher level, on tokens.

There are two ways to write a parser:

It can take lexer/tokenizers at different levels. For instance, the parser may invoke a 1-character-at-a-time function in some places (such as inside a quoted) and a 1-token-at-a-time function in other places (such as in normal text). In an ISO C parser, at least 4 levels are useful (due to backslash-newline, trigraphs, \u escapes, and such). Or
It can work with a single lexer/tokenizer function across the entire grammar, but this function needs some context. In our grammar, it is possible to use a lexer/tokenizer function (as the basis of an LALR(1) parser) that
- produces individual characters and tokens for "{{", "input", "local", "match",
- receives 1 bit of context, that enables tokenizing everywhere except inside a text-escape or a reserved-escape.

But to write a tokenizer it is best to not need much context, just consume characters.

Yes. As said above, this can be achieved with 1 bit of context.

On occasions the tokenizer might enter a different mode, for example in strings ("..."..."), block comment (/.../) or line comment (//...).

But these are one single level mode changes.
When I enter string mode, we stay there (with ") all the way to the last ".
We don't look for // or /* in a string.
Same, when we are in block comment mode we don't care about " or //. And same for line comments.

For our grammar the tokenizer must know when we are inside a reserved section, but to get there it needs a { (to ender placeholder mode), then a &foo to enter reserved mode.
So it has to look for relatively high level information (concepts like "placeholder"), that belong at the parser level.

I disagree. As stated above, all the tokenizer needs to know is whether the parser currently is inside a text-escape or a reserved-escape. It does not need to know about placeholder, it does not need to know about quoted, etc.

Worse, inside a reserve sequence we are still looking (and recognizing) a literal (block 5 above).
We need to know we are in a literal, in a reserved, in a placeholder.

There is no problem with literals in the grammar. Maybe you designed a tokenizer that attempts to returns literals as a single token? If so, that is a problem with the design of that tokenizer, but not a problem with the grammar.

I mean, you would have the same problem in an ISO C parser, if you wrote a tokenizer that returns -5 as a single token. Then you would have problems with the unary-minus operator: return - 5; and return -5; would parse differently. The solution here would be to modify the tokenizer so that it returns - and 5 separately. In summary: To fit a certain grammar, you need to adapt the tokenizer appropriately.

Bruno Haible · Answer 8 · Wed Mar 20 2024 23:42:36 GMT+0800 (China Standard Time)

I think that there is no way to use @ inside a reserved body:

I think this is intentional. Because otherwise there would be a parsing ambiguity inside an annotation-expression:

{ %foo @x }

could be parsed as

'foo @x' as reserved-body
no attributes

or as

'foo' as reserved-body
'@x' as attribute

Addison Phillips · Answer 9 · Wed Mar 20 2024 23:56:16 GMT+0800 (China Standard Time)

I think that there is no way to use @ inside a reserved body:
I think this is intentional.

This is 100% intentional. A reserved annotation is part of annotation and needs to behave like any other annotation. This includes not subsuming other constructs, such as attribute.