shexSpec / shex

ShEx language issues, including new features for e.g. ShEx2.1

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

language stem should respect langMatches semantics

VladimirAlexiev opened this issue · comments

The following shape:
:SpanishProduct { schema:label [ @es~ ] }
Declares that products must have a label in Spanish or any variant of it (eg es-ES vs es-AR).

But LanguageStem is defined as simple prefix match (

s is a LanguageStem and n is a language-tagged string with a language tag l
and fn:starts-with(l, st)

It has these defects:

  • it will match language "Carro"@ese where ese is Ese Ejja, and I don't think those people got cars ;-)
  • it won't match "Carro"@ES but lang tags are defined to be case-insensitive.
  • (instead of st should refer to s)

Instead of simple prefix match, it should comply with semantics. RFC4647 defines tags for lang, script, dialect, region etc etc; and that it's case-insensitive. Assuming s doesn't end in - and assuming . represents concat, it can be defined eg like:
regex (l, "(^".s."$)|(^".s."-)", "i")
Note: a simpler regex would be "^".s."($|-)" but I don't believe the last part of it is valid.

Aside: is a bit unreadable. The script turns it into this more readable google sheet

TEST: @ericprud gave this example URL. For me, it doesn't load the test on first load (or control-shift-R) but loads it on second refresh (control-R):

Resolved with 20170915 meeting

Resolution: change language tag matching to follow RFC4647 per

voted by: Andra, Kat, ericP, tom

need feedback from @VladimirAlexiev on spec changes and tests before closing. Note that the issue demo fails on master (<shouldFail> passes because the test doesn't respet rfc4647) but passes the LanguageStem-rfc4647 branch.

Spec sounds good, I like the ref to Maybe say that * is not allowed, and what happens if I give an incomplete lang tag eg @e~ (answer: won't match any value).

Tests look correct, but:

  • feel a bit uncomfortable about using unregistered sublang tags like @fr-bel
  • Maybe do some case variation (the matching should be case-insensitive)

Cheers @ericprud !

I was going to do a separate PR to add "*" to the grammar a la

[55] languageRange ::= (LANGTAG | '*') ('~' languageExclusion*)?

I tried to find two region codes that where one was a substring of other. Do you know where I can find the canonical list of regions? I picked a valid three-letter ISO region code ("bel"). I guess I could switch from FR to DE and use the example from RFC4647 basic match.

Re: case variation, true. Early on, I had data files like spo@fr.ttl and spo@FR.ttl but I think some case-insensitive file system ate them long ago. Will re-add tests for that and for shex files matching @FR, . - ~@FR and @FR~ - ~FR-BE.

Regions: and filter by type=region.
These are 2-letter country codes and 3-digit continent-like codes. So there are no "substring of another".

But if there were, the matching is still the same: next should come dash or end of string. I.e. @en-G~ will not match @en-GB and @en-GR.

What do you want with *? Eg @*-GB to match any language spoken in Great Britain?

!!!!! Because Cyrl is the default script for ru, ru is the same as ru-Cyrl. This means that ru-RU~ should match ru-Cyrl-RU. My oh my.

And the star would add more complications

Re case sensitivity, I varied the case in the data and the schema. The latter raised a round-tripping issue to RDF. I invite you to review those PRs.

It is our belief that the semantics in ShEx 2.1 § 5.4.6 Values Constraint address this. Please close this issue if you agree.

I've read the section and I think it addresses this by reference to other standards. In particular I like:
st is a basic language range per Matching of Language Tags [rfc4647] section 2.1 and l matches st per the basic filtering scheme defined in [rfc4647] section 3.3.1.

In other words, one is not supposed to use an incomplete stem like en-G~