unicode-org / message-format-wg

Developing a standard for localizable message strings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEEDBACK] The Literal 'quoted' property is obsolete

bhaible opened this issue · comments

In the data model descriptions, there is a contradiction whether the Literal object contains a quoted property or not:

Because of the statement "Implementations MUST NOT distinguish between quoted and unquoted literals that have the same sequence of code points." in https://github.com/unicode-org/message-format-wg/blob/main/spec/syntax.md, and because of the purpose of the data model is "to allow interchange of the logical representation", I believe this property should be removed. That is, the value property is the unquoted literal.

In other words, quoting/escaping is part of the parser's job, not part of the data model.

Thanks @bhaible for the comment. I believe your feedback is correct, particularly

In other words, quoting/escaping is part of the parser's job, not part of the data model.

I think this statement is "upside-down":

That is, the value property is the unquoted literal.

The value property is the literal. Whether it is quoted or not is a detail of the serialized messages and not visible in the data model. When re-serializing the data model, some (but not all) values can be represented as unquoted literals. It is up to the implementation as to whether the quotes are omitted or not.

Sorry to reopen, I didn't see this on time.

Removing the "quoted" metadata from the data model makes it harder to test the round-trip property. As it was, you can "normalize" the input string in a context-free way (delete all optional whitespace) and test that against the output of the serializer. Removing unnecessary quotes requires context. It's not that it's impossible, but in my opinion, anything except whitespace should be preserved in the data model so that a serializer can use it to reproduce the same message as the one that was parsed (modulo whitespace).

anything except whitespace should be preserved in the data model so that a serializer can use it to reproduce the same message as the one that was parsed (modulo whitespace)

I don't agree. The quotes on unquoted are optional. We say quite explicitly that there is no difference between a quoted and unquoted literal. An implementation is therefore allowed to quote any unquoted literal (and this probably should be considered the "canonical" form).

Removing unnecessary quotes requires context.

In what way? The production unquoted only appears in the production literal. Anywhere a literal can appear, an unquoted can appear (sans quotes), but can be replaced with a quoted literal too. The only thing an implementation has to check is: in order to put it unquoted, the contents of the literal (and nothing else) have to be checked.

Also, note that we allow other changes when round-tripping. For example, I don't think we guarantee the order of the variants.

In what way? The production unquoted only appears in the production literal. Anywhere a literal can appear, an unquoted can appear (sans quotes), but can be replaced with a quoted literal too. The only thing an implementation has to check is: in order to put it unquoted, the contents of the literal (and nothing else) have to be checked.

"Context" was probably bad phrasing on my part, since removing whitespace also requires context (knowing if you're currently parsing an s or an [s].)

I would probably be less uneasy if there was a "canonicalization" algorithm for message strings as part of the spec.

Also, note that we allow other changes when round-tripping. For example, I don't think we guarantee the order of the variants.

That's not a problem for round-tripping since the implementation is free to sort the variant lists so as to preserve the ordering. But losing information is.

I think the quotes aren't information? Let me put is a different way: the quotes do not affect anything that might happen to a literal in a data model.

Notice that we do not preserve the {{ and }} for patterns in the data model. The pattern quotes don't appear in simple messages, but the data model doesn't distinguish between the simple message Hello world and the complex message {{Hello world}}.

Okay, I think I'm going to have to try updating my implementation to the current syntax and data model before I can say more with certainty. Closing for now.