unicode-org / message-format-wg

Developing a standard for localizable message strings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about grammatical case

macchiati opened this issue · comments

I want to come up with a good example that shows the use of case inflections. (I realize that is not in scope for the MVP coming up, and don't want to derail that! This would be for the future).

Here's the example:

.match {$userGender :gender} {$count :integer} {$sourceCity}
female one {{Benvenuta, {$count} packages have arrived for you from {$sourceCity case=genitive}}}
female * {{Benvenuta, {$count} packages have arrived for you from {$sourceCity case=genitive}}}
* one {{Benvenuto, {$count} packages have arrived for you from {$sourceCity case=genitive}}}
* * {{Benvenuto, {$count} packages have arrived for you from {$sourceCity case=genitive}}}

This message is artificial for readability, but represents real grammatical features.

  • “Welcome” has two forms in Italian, based on the gender of the listener/reader: Benvenuta for females, Benvenuto for males. In fact, the "you" would also change format based on the gender of the listener/reader in languages like Arabic or Hebrew.
  • “London” (nominative case) becomes “Londona” (genitive case) after “iz” (=”from”) in Serbian.

And here's the question/comment:

There are are some interesting features about grammatical case.

  1. grammatical case can potentially be applied to any placeholder: strings, names, dates, currency values, etc. This could be done by software outside of the particular function being used.
  2. unlike other options, the translation tooling should allow it to be changed by the translator, to suit the language: genitive for some languages, ablative for others, etc.

Is this something that was considered already, where I can point people to an issue discussing it?

Have a look at #450 (expression attributes), which is a potential mechanism that we might use for this. We could use options, as you show in your example patterns, but that requires every function to declare the relevant options and their values. Expression attributes, since they are built-in, might work better. They also can be excised or omitted when a language doesn't need them.

There has been some discussion of this problem in various issues, mostly set out-of-scope for this release. Grammatical case has some of the complexities that you mention--its related to the "bone dragon" problem and is something that LLM types of solutions can do more readily in a scalable way than static patterns. In some cases, the gender/count/etc. contents of the placeholder affects static words in the pattern, so the combinatorial nature of the solution expands rapidly.

It will also be a problem in that developers will have no clue how to use something like this reliably.

Agreed that LLMs can be an ultimate solution — and would certainly be for
any really complicated messages. But there are limitations to their
usage also (otherwise we wouldn't be doing MF2). So we are looking at the
class of relatively straightforward messages that can be handled reasonably
well by MF2, when enhanced by an inflection engine.

Totally. The challenge here is that a one-off solution for a controlled situation is relatively straight-forward to build. But a general purpose solution is elusive. Some languages don't need this at all. Others have complex interword dependencies. Also, the gender and case types don't always match up between languages. I want there to be a solution for the "less than LLM" needs.

Anyway, this will be great to take up post-LDML45. In the meantime, in Monday's call we need to discuss expression attributes. I think these might be an important building block for eventual support of grammar handling.

I think annotations/attributes could be the answer too. I don't want at all to derail 45, so what I'm thinking is just to get the syntax in, reserving the interpretation for later, we can experiment with it during the tech preview stage. Something like:

annotation = (function *(s option))
→
annotation = (function *(s option) *(s attribute))
…
attribute ="@"  identifier [s] "=" [s] (literal / variable) // attributes are reserved for experimental use in tech preview

I agree that handling inflections is tricky, but I think we can make some substantial headway, bearing (strongly) in mind that Il meglio è l'inimico del bene.

As a speaker of a language with grammatical inflection for nouns, adjectives, pronouns, and numbers, being able to express grammatical case and other properties is very dear to my heart.

I don't think expression attributes are the right answer. First, I think they're at risk right now. Second, they're more likely to be oriented towards tooling than runtime. Third, at runtime, functions should be enough to encode grammatical inflection and agreement. This is why I don't see why we'd put this out of scope. All the parts required to make this work are already in place.

In the message above, I'd imagine a new custom function, let's call it :geoName. This function has different signatures for different locales; for Serbian it takes a case option:

.match {$userGender :gender} {$count :integer} {$sourceCity}
female one {{Benvenuta, {$count} packages have arrived for you from {$sourceCity :geoName case=genitive}}}
female * {{Benvenuta, {$count} packages have arrived for you from {$sourceCity :geoName case=genitive}}}
* one {{Benvenuto, {$count} packages have arrived for you from {$sourceCity :geoName case=genitive}}}
* * {{Benvenuto, {$count} packages have arrived for you from {$sourceCity :geoName case=genitive}}}

How the function is implemented at runtime is not specified. It can be a dictionary lookup, some logic expressed in code, or an API call to an LLM. Importantly, the grammatical information is encoded in the message's AST: in Serbian, the city name must be in genitive.


We need the registry to know about the different options in inflected languages because things can get pretty complex rather quick. On top of it, we need functions to be able to inspect the resolved values of other expressions. It's my understanding that both of these features are in scope for LDML45, with the

Consider the sentence You have 2 red crayons., in which both the color and the object are parameterized:

.input {$count :number}
.local $obj = {$object :noun count=$count}
{{You have {$count} {$color} {$obj}.}}

In certain inflected languages, the color needs to accord the grammatical gender, number and case with the object. Note that while we explicitly know what the number is (the $count param), the case has to be set by the translator, as dictated by the grammar of the target language:

.local $obj = {$object :noun case=accusative count=$count}
{{You have {$color :adjective case=accusative count=$count} {$obj}.}}

However, that's still not enough: the adjective now agrees on case and number, but it also needs to agree on the grammatical gender of the object — which isn't available in this message. In fact, it's unknown until $obj is evaluated. The only way to know the gender is to inspect $obj after it's been resolved:

.local $obj = {$object :noun case=accusative count=$objectCount}
{{You have {$color :adjective accord=$obj} {$obj}.}}

Since the shape of resolved values is implementation-specific, this won't be possible in all implementations, but it will be possible in some of them.

Expression attributes may not be the right solution, and I have no objection to holding off on them. During the tech preview phase, we can do experimentation with options instead. So I don't want to deep-end on this in the near-term meetings, because we have some hard deadlines coming up.


As too your broader point, we have to recognize that languages can get very complex, and our goal is to deal with messages that translators can relatively easily handle. I don't think we can expect MF2.0+ to be arbitrarily powerful in generating messages; that leads to very complex structures like those used in Google Assistant, ones that will probably eventually be replaced by LLMs.

I think the first focus for MF2.0 will be to make the kinds of messages that people generate right now and expect to use in the next decade. Those are messages with placeholders that are predominantly noun-phrases plus generated variables (numbers, dates, times, currencies, ...). You will find few if any that combine, for example, {$color} {$obj} together, where $color is an arbitrary color and $obj is an arbitrary object. Nor will you find placeholders very often being verbs; that gets very complicated very quickly.

In the vast majority of cases, we will start with a source language (often English, but not necessarily). The translation software will typically expand out the variant messages in accordance with the possible options for the particular values in that language (2 plurals in English going to 6 in Arabic).

English source, for example will not be marked up with case options; those will need to be added — and almost any noun-phrase placeholder in an inflected language could need them. So as well as your proposed :geoname, we'll also need case options for dates, times, date-intervals, currency values — even numbers, for when a spell-out format is specified. The number of placeholders that will not need them is probably many fewer. That's why I was thinking that the @Attribute mechanism might apply, whereby the annotation information is accessible to the function, if it can handle it.

But it also puts a big burden on all of these functions to handle grammatical case. We also need to consider the model where a general-purpose inflection engine is available, such as one that can take a placeholder noun phrase and change it to reflect a particular case. It might well be more efficient to incorporate such an engine into a MF2.0 implementation at the top level, without requiring all functions to handle casing.

That brings up an important point. If there is a case option or attribute supplied to a function, it should be able to communicate up to the MF2.0 engine whether or not it was able to handle that option/attribute so that the MF2.0 engine could apply a general inflection engine if and only if the function doesn't (preventing double-inflection).