unicode-org / message-format-wg

Developing a standard for localizable message strings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to refer to CLDR data on plural rules?

eemeli opened this issue · comments

As discussed on #534 (comment) and previously, it would be Very Good to be able to make use of the existing CLDR data at least in plurals.xml and ordinals.xml, which effectively describe which locales use which plural categories, and how those are selected. The structure of these files is described in ldmlSupplemental.dtd.

As far as I'm aware, only information for plural and ordinal categories is available with this specific format. For example, the CLDR grammaticalFeatures.xml file has a rather different structure for its presentation of other grammatical features, which may become useful for other formatters and selectors.

As @aphillips notes, "Perhaps we should have a referencing mechanism to CLDR instead of replicating data [for plural matching]."

However, this isn't eminently straightforward, as evidenced by the fact that this hasn't been done yet. In terms of what's theoretically achievable here, we have the following capability levels:

  1. We can just go with the full set of categories with a <match values="zero one two few many other">, which does not require any additional data. We'll need to provide this baseline in any case; everything else filters this to some subset.
  2. If we can make the <plurals type="..."><pluralRules locales="..."><pluralRule count="..."> attribute information available to registry users, they can determine that given a type (cardinal or ordinal) and a locale code, the count attributes of the set of <pluralRule> elements defines the available locales.
  3. If we can parse and process the contents of the <pluralRule> elements, we can further restrict the locales in many cases. For example, we could determine that in English, a numeric selector with minimumFractionDigits=2 will only ever resolve to the other category, or that in Polish an :integer plural selector would only match one, few, or many, and never other.

In order for us to go beyond Level 0 in the core registry definition without actually duplicating data, I think we would need something like an XSL Transform specifically for plural and ordinal data, and a referencing style where we could say something like (syntax only indicative):

<matchRef href="path/to/plurals.xml" transform="path/to/plural-match-mapper.xsl"/>

I'm not sure that there's a reasonable way to extract the @integer / @decimal -ness of the rules with XSLT, to allow for reaching Level 2.

Without the transform itself, we could also leave a SHOULD-ish statement in the description of numerical selectors for tool builders to narrow the full set using CLDR data where appropriate.


We should decide how much of this we consider to be within scope of the core registry, and how much we intend to get done for next Spring's release.

I think that a minimum set for the spring release would be a list of functions and their options and, in the case of selectors, a general description of available matching keys. For plural for example, this would be anyNumber and a list of the CLDR keywords (zero, one, two, few, many, other/*), but not the locale-based tailoring of same.

As @eemeli notes, ideally we would establish a link to CLDR data where that's appropriate. If a transform were needed strictly for MF purposes, maybe that could be produced?

A possible solution to the titular question here is presented in #558 (comment), including a proof of concept parametric XSLT transform.

I think this can be moved to Future as the actual format of the registry will be post-45? I still want to solve this, but looking to manage scope.