json-schema-org / json-schema-spec

The JSON Schema specification

Home Page:http://json-schema.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Schema Update: Require dialect to be declared and understood

jdesrosiers opened this issue · comments

Update the spec to require that implementations require that a dialect is declared. Implementations MUST refuse to evaluate schemas without a known dialect.

In order of precedence, a dialect can be determined in the following ways.

  1. The $schema keyword
  2. The schema media type parameter (if the schema was retrieved over HTTP)
  3. The user provides a default through an implementation's API in some way. (optional)

Implementations MUST NOT guess the dialect based on the contents of the schema and MUST NOT use a default not specified by the user. If a dialect is declared, but the implementation doesn't understand it, the implementation MUST refuse to evaluate the schema. It MUST NOT attempt to evaluate the schema using a known dialect.

I think most (many?) implementations may have a problem with point 2. I would guess that most are like mine and don't have any interaction with HTTP, and thus have no access to headers or media types. This means that any reading of such a header would have to be performed by the user and manually set in the implementation (point 3).

We should include text that implementations are not required to be able to retrieve schemas over HTTP or other means that would provide media type parameters.

I think most (many?) implementations may have a problem with point 2.

Point 2 only applies if a schema is retrieved by the implementation over HTTP. If the implementation doesn't retrieve over HTTP at all, point 2 doesn't apply. There's nothing to have a problem with.

This means that any reading of such a header would have to be performed by the user and manually set in the implementation (point 3).

Exactly. If the user fetched the schema manually, it's their responsibility to check the headers if necessary and pass than along through configuration (point 3). However, the intent of this requirement isn't to tell users what to do, it's to tell implementations what to do. It might be more clear to think of the case where a local schema includes a reference to an external schema not in local schema storage. Generally, the implementations refuse to evaluate the schema, but it's allowed (enabled by config) for implementations to fetch that external schema over HTTP. In that case, it should be looking at the media-type parameter as part of determining the dialect. If the implementation chooses to fail and not fetch the schema, they skip the execution branch where media-types matter.

If the implementation doesn't retrieve over HTTP at all, point 2 doesn't apply.

We should be more explicit about what's required of an implementation here. I'm not sure there's anything in the spec implying that an implementation even should retrieve a schema over HTTP. In fact, in regard to reference resolution, we actively discourage it. So I'm not really sure why point 2 would be a consideration for implementations.

I can see point 2 as being guidance for users to know where they can find the dialect to supply to an implementation (point 3), but I don't think implementations should be doing this. I think including this in a prioritized list is misleading for implementors.

Generally, the implementations refuse to evaluate the schema, but it's allowed (enabled by config) for implementations to fetch that external schema over HTTP. In that case, it should be looking at the media-type parameter as part of determining the dialect. If the implementation chooses to fail and not fetch the schema, they skip the execution branch where media-types matter.

Since the default behavior is for the implementation to not fetch the external resource, the user has to configure the implementation to do this. If the implementation provides a method that does this, I see this as the implementation acting on the user's behalf, and so, yes, that logic still needs to check the header.

My approach to this is to provide an interface where a user can define function that fetches a resource. However, I don't provide any built-in function that does this. The user has to write that code. Thus it's still the user's responsibility to check the headers. To me, these are separate.

Because of my setup, it seems to me that if an implementation does provide that function, it's acting as its own user, where "user" has to look at the headers and configure "implementation" appropriately. But "implementation" doesn't need to know where the configuration came from, just that it was configured. As far as "implementation" is concerned, it's only ever steps 1 and 3.

It's a subtle difference, but that's how I think of it.

I think this small edit may make me feel better:

  1. The $schema keyword
  2. The schema media type parameter (if the schema was retrieved over HTTP) (optional)
  3. The user provides a default through an implementation's API in some way. (optional)

It's still prioritized, but step 2 is also optional.

I'm not sure there's anything in the spec implying that an implementation even should retrieve a schema over HTTP. In fact, in regard to reference resolution, we actively discourage it. So I'm not really sure why point 2 would be a consideration for implementations.

The spec includes defines support for sending/receiving schemas and instances over HTTP including two media types and a Link header specification for associating a retrieved instance with a schema. Without this functionality Hyper-Schema isn't possible.

Yes, we do discourage retrieving schemas that weren't explicitly given to the implementation. (That recommendation is problematic for a couple reason, but that's another discussion for another time.) But, that doesn't mean it's not a part of JSON Schema. All the spec say is that implementations shouldn't enable it by default, not that they shouldn't do it. A lot of implementations do support this and a lot of them enable it by default even tho they technically shouldn't.

My approach to this is to provide an interface where a user can define function that fetches a resource.

Sure, you could do it that way and avoid having to deal with media types. Doing it that way that would make it the user's responsibility to look at headers. Again, that means it wouldn't apply to your implementation, but there are plenty of implementations where it would apply.

I think this small edit may make me feel better

Great, we can get really precise when we get the spec wording. However, I want to make sure we're on the same page that it's NOT optional for an implementation to take the media type parameter into consideration, it's just optional for an implementation to fetch a schema in the first place. If and only if the implementation fetches a schema, it MUST consider the media type parameter before configured defaults. If the user fetches the schema, even through a plugin to the implementation, it's not the implementation's responsibility.

The schema media type parameter (if the schema was retrieved over HTTP)

What are the mechanics of how this works? From what I can gather from Wikipedia and RFC 2045, I assume that the Content-Type header will be something like

application/schema+json; schema=https://json-schema.org/draft/next/schema

Is that right?

The spec says that parameter values are generally case sensitive unless specified by the parameter. I imagine we'd need to specify case-insensitive as we'd want implementations to normalize URIs.

Is all of this part of the media type registration?

Is that right?

Yes that's right, except the URI needs to be in quotes for the syntax to be correct. (schema="https://..."). Iirc, the / is not allowed unless the value is quoted.

Is all of this part of the media type registration?

yes

From #1434 (@jdesrosiers)

The fourth rule is an embedded schema inheriting from its parent context. I think it belongs after the media-type parameter and before the user provided default.

An embedded schema isn't going to have a separate media type from its parent schema. It doesn't make sense to consider the parameter at all.

The dialect is first its parent schema's dialect, and then it can be overridden by a local $schema keyword, then by user configuration (if the implementation even supports configuring embedded schema resources, which seems unlikely).

I've re-read that, and I've made the algorithm more complex than it needs to be (although it's saying the same thing).

Put differently,

  1. $schema
  2. Parent dialect
  3. Media type parameter
  4. User configuration

Caveats:

  • If you're at the document root, there is no parent dialect, obviously.
  • The only time you'd get a media type is if you've fetched the resource, which implies you're at the root, so any parameter wouldn't apply to embedded resources.
  • Support for (and desire to use) user configuration of embedded resources still seems unlikely. It's more likely that a user would say, "Process this as 2020-12," and that applies to the entire evaluation.

So really it's

  1. $schema
  2. Parent dialect / Media type parameter (depending on whether you're at the root and what's available)
  3. User configuration

I agree. I didn't mean to imply the media type parameter would have any bearing on embedded schemas. Not all steps apply in all situations. I intended that implementations would skip steps that don't apply in a specific situation. The context dialect and media type parameter only apply is certain mutually exclusive situation, so they could be listed in any order of presence or presented as the same step as you did at the end. The reason I put the media type parameter first was because it's a case where the user is being explicit about what they want where the context dialect is a passive indicator, but ultimately it doesn't matter which comes first because they're mutually exclusive.

Support for (and desire to use) user configuration of embedded resources still seems unlikely. It's more likely that a user would say, "Process this as 2020-12," and that applies to the entire evaluation.

I'm not sure what you mean by that exactly or how it might work, but it doesn't seem like something we need to consider at this point.

For example (and this is really contrived)

{
  "$id": "schema:outer",
  "$defs": {
    "inner": {
      "$id": "schema:inner",
      "type": "integer"
    }
  }
  "type": "object",
  "properties": {
    "foo": { "$ref": "schema:inner" }
  }
}
schema.useDialect("schema:outer", draft7);
schema.useDialect("schema:inner", draft202012);

This seems dumb to me, but there's no reason why an implementation couldn't do it.