opengeospatial / ogcapi-processes

Home Page:https://ogcapi.ogc.org/processes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add minOccurs and maxOccurs to output description

pvretano opened this issue · comments

Right now, the specification makes the assumption that the correlation between a named output and an output value is 1:1. However, I've run into use cases where it would be convenient to allow the correlation to be 1:N or some other ratio. I'm thinking of a simple use case where you have a process that has a single input with minOccurs=1 and maxOccurs=unbounded and generates a single output ... but you want one output value for each specified input.

Right now the only way I can see this being described is using the schema element with an array but that kinda breaks the "the scheme describes a single instance value" rule that applies to inputs.

Adding minOccurs and maxOccurs to the output description and extending the results path to /jobs/{jobId}/results/{outputID}/[N] where N is the Nth output value that corresponds to the Nth input value would allow a more symetrical description on inputs and outputs.

The same rules would apply to fetching a single output or multiple outputs that currently apply. If you GET /jobs/{jobId}/results/{outputID| you get a results.yaml with all the output values. If you fetch /jobs/{jobId}/results/{outputID}/[N] you get the content-negotiated output as you currently do. The defaults would be set to be backward compatible.

Might be related to #412.

Right now the only way I can see this being described is using the schema element with an array but that kinda breaks the "the scheme describes a single instance value" rule that applies to inputs.

If the output definition is a schema defined as an array of 1 or more elements, with a description that says there will be as many elements in that array as occurences of that input, I don't think this breaks anything -- the schema does define one output: an array of multiple elements making up the single output instance.

If you GET /jobs/{jobId}/results/{outputID} you get a results.yaml with all the output values.

Having a results.yaml response at this end-point would introduce so much additional confusion, because results.yaml is used for /jobs/{jobId}/results when negotiating application/json.

Things remain much simpler if it is only used there, because normally GET /jobs/{jobId}/results/{outputID} always returns the actual output. There was already so much confusion about the meaning of minOccurs and maxOccurs with the input, that I think we should avoid introducing it for the outputs.

But if one wants to do something like you describe, then one could simply define a JSON schema for the output that's similar to results.yaml with links to individual outputs? (the links could probably even be to /jobs/{jobId}/results/{outputID}/[N] since that doesn't seem to break anything).

Other potential alternative way to handle such use cases:

  • Separate execution request for each output part (especially since this seems like a use case for things where each occurence of the input can be processed independently, so there is not really a need to submit the execution as a single request)
  • Part 3 Collection Input & Collection/Dataset Output -- just like plugging input/output jacks and then tuning in to the bits of interest when you need them, accessing them just as if the data was already processed and available on a regular OGC API data service, except the magic of workflows happens behind the scene.

@jerstlouis so much confusion? Really? The basic rule still applies just clarified a little bit. If you are GET'ing a single output VALUE then you get the bare value subject to content negotiation. Otherwise you get results.yaml. Seems pretty simple to me.

Seperate execution is not a solution to this problem. The outputs may or may not be correlated to the inputs. I was trying to provide a simple uncorrelated example but we have a process from a client that takes N inputs and generates M outputs where M!=N.

Part 3 seems like an overly complex solution to a simple problem. Besides, these deployed processes are not created by us and so we may have no control over how the outputs are generated or put into and OGC collection or even if the creator of the process want the data to be presented and/or accesses that way.

SWG meeting from 2024-05-27: @pvretano will create a PR for us to check the ramifications.

I think this might cause issues for KVP execution.

A request is allowed to indicate ?input1=x&input1=y and aggregate the values of input1 however needed to be mapped with the minOCcurs/maxOccurs definition of the corresponding "scheme that describes a single instance value" rule. Because there is a distinction between cardinality/dimensionality, this mapping can be resolved.

However, a corresponding outputs definition would not work, as they need to be accessible by the specific ID under /jobs/{jobId}/results response. Receiving an array nested under the output-ID mapping with minOCcurs/maxOccurs and/or minItems/maxItems for either cardinality or array-dimensionality would bring up the same kind of ambiguity that justified the use of minOCcurs/maxOccurs for inputs. Even if the /jobs/{jobId}/results/{outputID}/[N] endpoint was added to resolve this situation, /jobs/{jobId}/results would still need to be disambiguated.

I am curious about the use case of the process that takes N inputs and generates M outputs where M!=N. If N are not linearly related to M, how are the corresponding inputs/outputs logically mapped together?

I tend to agree with @jerstlouis on this. I believe this case might be better addressed with some alternate multi-request handling or an alternate process output definition. I believe that minOCcurs/maxOccurs for inputs makes sense for cases where N->1 applies, such as a tile-stitching operation, since the variable amount of N is unknown but are combined into one output. However, given a certain N inputs, there should always be some relationship to figure out the expected amount of outputs. Otherwise, how would the client figure out how many /jobs/{jobId}/results/{outputID}/[N] to look up?

Even if the /jobs/{jobId}/results/{outputID}/[N] endpoint was added to resolve this situation, /jobs/{jobId}/results would still need to be disambiguated.

@fmigneault I don't understand which ambiguity you are refering to ...

GET /jobs/{jobId}/results
If the process has a single output and that output has a single value then you get a 200 back and the response body contains the output value. Otherwise you get a 200 back and the response body conforms to results.yaml.

GET /jobs/{jobID}/results/{outputID}
If the requested output has a single value then you get a 200 back and the response body contains the output value. Otherwise you get a 200 back and the response body conforms to results.yaml.

GET /jobs/{jobID}/results/{outputID}/N
Get a 200 back and the response body contains the output value.

Let say minOccurs/maxOccurs was allowed for outputs.

If you have a process that has at the same time:

  • output1 with minOccurs=1, maxOccurs=2 single value
  • output2 with minOccurs=2, maxOccurs=3 single value
  • output3 with minOccurs=1, maxOccurs=1 array type
  • output4 with minOccurs=1, maxOccurs=2 array type
  • output5 with minOccurs=2, maxOccurs=3 array type

Arrays in GET /jobs/{jobId}/results would mean different things at different levels according to cardinality/dimension.
Responses from GET /jobs/{jobID}/results/{outputID} would return different array/value combinations.
The GET /jobs/{jobID}/results/{outputID}/N would in some cases produce different results than the non-/N endpoint and in other cases would be identical (or inapplicable).

For a client, interpreting these combinations becomes very complicated quickly.
It is even worse when "single value" and "array type" can themselves be an array, or an array of arrays, etc.
Then, there is the situation where those various outputs need to be combined with the various cardinality/dimension combinations in a workflow.

I find we are not considering the more complicated use cases, and are potentially recreating/increasing the (already complicated) considerations about inputs where these kinds of issues were mentioned time and time again.

@fmigneault if maxOccurs is greater than 1 then the output value is always encoded as an array in the response ... even if there is only a single value. If the output values happen to themselves be arrays and maxOccurs>1 then the response will be an array of arrays. Using your examples ...

. output1 with minOccurs=1, maxOccurs=2 single value -> [value]
. output2 with minOccurs=2, maxOccurs=3 single value -> [value1,value2]
. output3 with minOccurs=1, maxOccurs=1 array type -> [value]
. output4 with minOccurs=1, maxOccurs=2 array type -> [[v1,v2,...vN]]
. output5 with minOccurs=2, maxOccurs=3 array type -> [[v1,v2,...,vN],[v1,v2,...,vN]]

For outputs 1,2,3 and 4 you can use the /N notation to get th Nth value. For output 3 we should probably generate an out-of-range exception if you try something like /jobs/{jobId}/results/{outputID}/[N] becuase the response is not an array ... the value is an array but the response is not an array. OAProc only returns "complete" values so you can't, for example, request a sub-field of an object; you can only get the entire object.

if maxOccurs is greater than 1 then the output value is always encoded as an array in the response ... even if there is only a single value. If the output values happen to themselves be arrays and maxOccurs>1 then the response will be an array of arrays. Using your examples ...

That makes sense.

What I foresee becoming an issue is regarding maxOccurs=1 and schema: {type: array, items: {}}.
The single-output value would be encoded as an array, although maxOccurs=1, and there would be no way to distinguish it from a maxOccurs>1 array of, eg: schema: {type: int}, when looking at the response from /jobs/{jobId}/results or /jobs/{jobId}/results/{outputID}.

For output 3 we should probably generate an out-of-range exception if you try something like /jobs/{jobId}/results/{outputID}/[N] becuase the response is not an array

That would be a great way to handle it. It would allow a clear distinction between a multi-output array and a single-output that happens to be an array. Ideally, there would be a way to infer this directly from the JSON responses on /jobs/{jobId}/results and /jobs/{jobId}/results/{outputID} as well to avoid the additional request.