Support Library/$evaluate evaluation

Question

Support Library/$evaluate evaluation

brynrhodes opened this issue 3 months ago · comments

Bryn Rhodes commented 3 months ago

Get the runner to the point that it can run Groups of tests as a Library using the Library/$evaluate operation

Bryn Rhodes · Answer 1 · Tue Apr 30 2024 02:02:34 GMT+0800 (China Standard Time)

Initial focus here should be on:

Building a library with ELM (i.e. an ELMJsonLibrary
Invoke Library/$evaluate, but with a libraryResource parameter that inlines the Library (rather than having to POST first)

richfirely · Answer 2 · Wed Jun 19 2024 20:52:56 GMT+0800 (China Standard Time)

@angelok1 @brynrhodes This PR and Angelo's PR is similar. I think for the sake of consistency we should stick to javascript.

For Library/$evaluate to work, a FHIR resource would need creating or alternatively just the CQL extracted, then the participant will ingest them to their server sometime before test running. The workflow of building these library files during test running is too late for some engines. For the short future, the Firely CQL dotnet SDK team needs to manually convert CQL to C# to FHIR Library outside of the FHIR server and load them separately. I think maybe Angelo has a similar requirement, converting CQL to database language, but not on the fly.

To automate this, please allow me to spitball a direction that is totally up for discussion:

Create an api endpoint on the test-runner that will output the current tests in various formats. The benefit here is that the test-runner composes the test output rather than having every engine do it and possibly be inconsistent.
- GET /api/tests?_type
  - output would be a zip payload of the _type.
  - _type would be one of the following:
    - null/empty - a zip payload of tests as they originally exist
    - cql - the CQL libraries grouped and extracted from the test XML
    - elm - the Elm version of the CQL libraries grouped and extracted from the test XML (optional parameter _format=json or xml)
    - fhir - FHIR library resources of the same, grouped and extracted from the test XML (optional parameter _format=json or xml)
A FHIR server, engine, or separate simple API query software, would regularly query this endpoint (ideally only upon github cql-test commits) and then process the test.
- For the non-FHIR engines or those that do not want to expose a public endpoint, they could pull in the tests, perform any conversion, and run them and provide back the results as we discussed in the DQIC. This would be another endpoint /api/submitresults - where the engine can post their results to update the UI on demand.
- For FHIR servers with a public endpoint, they would query api/tests upon commit and convert and load the FHIR resources to their server. Then wait for the scheduled run of library/$evaluate where the test-runner sends the request to run each group by passing the library name. For scheduled runs, I think a couple hours after test commit is probably sufficient if the engine has created the automated process of updating their FHIR libraries upon commit. There could also be a schedule or manual action for re-running for a particular engine in the UI.
Once this tool is published to the web, we'll have to add some data about the target engines, both for results UI and to know what/if the endpoint is, and maybe some other data about what tests are intentionally skipped/unsupported and why, etc. Quick top level data example:

Firely | https://server.fire.ly | $evaluate 
Smile | https://cql-tests.smiledigitalhealth.com | $cql
Carrera | n/a | submitresults  <-- i.e skip automated test
Google | n/a | submitresults  <-- i.e skip automated test

I realize the scope is greater than this ticket, but I figured a discussion is in order to avoid going down a path that is not flexible for everyone.

Angelo Kastroulis · Answer 3 · Thu Jun 20 2024 20:31:50 GMT+0800 (China Standard Time)

I see your point that the underlying approach is javascript, so it would make sense to have a unified approach. Here is my thinking as to why I felt python was the right fit:

@richfirely One of the issues I am taking away from what you're saying is that different systems will need different preparation steps. Some will need to compile and others will need to transform/load data. To some extent, our system would need to do that, too (e.g., load test data). I'm sure many (most?) systems will ultimately do that.

That presents a problem because we either need to bake in hooks everywhere, or somehow free the toolchain from that concern.

Many benchmarks address this by offering a toolkit rather than a test runner framework. The toolkit generates all of the inputs for your test in a standard format. The implementer is responsible to prepare, execute, and collect output. There are rules as to what is allowable (for example, you can't alter their substance, but you can rename them, compile them, break them up, combine them, and whatever else is practical).

For our tests, an engine would also need the test XML which defines the expected results. That could be another _type in the model above, but we would need another configuration artifact that tells the end user what to collect, per test (which tests I need to ask for, which XML configurations, CQL, etc). We may need that anyway.

In my mind, the script itself ensures the test content is consistent since it is generated from the XML source of truth. In the scenario of a REST endpoint to do this job, you would need to clone the repo to get the test configuration to then call the server you'd need to stand up just to serve back the same content you just cloned.

The reason I lean toward python for this script is that the minimal required by any implementer would be to clone the repo, run the script (which gives you your artifacts) and you're done with your inputs and the repo has served its purpose. Python is ubiquitously available with nothing else to install or stand up. If an implementer would like more automation provided, they could stand up the testrunner. But, in my opinion, the test runner will not be of much utility to most implementations since they will all have a custom setup for whatever their platform does that we can't possibly foresee and bake into the approach.

This is my 2 cents.

richfirely · Answer 4 · Tue Jun 25 2024 02:43:06 GMT+0800 (China Standard Time)

I created my initial comment with the thinking that this was eventually moving toward a centralized test server rather than a tool everyone would download into their pipeline. That was clarified to me today on the DQIC meeting and changes the discussion quite a bit.