Open-EO / openeo-r-client

R client package for working with openEO backends

Home Page:https://open-eo.github.io/openeo-r-client

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Local Prototyping UDF (Debugging)

przell opened this issue · comments

Title Local Prototyping UDF (Debugging)
Date 2021-11-18
Issue #94
Category Debugging
Description OpenEO UDFs allow the user to run arbitrary R code within an openEO process graph. In order to debug, parametrize and validate the function that is sent to an backend the user needs the possiblity to test the function locally. Ideally the user can retreive a subset of the data with the same dimensionality that arrives in the UDF service for local prototyping.
Dependencies openEO API definition
Links Local Backend for testing (#88)
Priority High
Impact High

With new approach to UDFs (bridge to python).
Idea:
Process Graph... run_udf(, debug = TRUE), return stars object as .Rdata, needs to be saved in user_workspace or returned via synchronous call.

As discussed internally the retrieveal of sample data is crucial for local prototyping. Therefore we need a function that allows the user to retrieve those data.

There we have different realization choices and face some problems:

  1. configurability: the user defines the process graph or we simply give some options for properties
  2. running the job: sync vs. async
  3. size: how large can the sample data get
  4. result retrieval: in genereal not a problem, but what happens if there are auxillary files that ship metadata
  5. results interpretation: not a problem for a single time instance image, but how is time propagated correctly and coherently amongst back-ends when downloading a serialized raster time series, also maybe band - all that information should resolve into a stars object with which the user can "play-around"
  6. data format: different back-ends will most definetely offer different file formats which will structure relevant dimensional meta data differently

The result interpretation bit might also be relevant for #39 and the immediate creation of a stars object. Unless there is a convenient and well-defined way of doing this, this will cause problems, because every back-end provides the data differently, which results in having different data representations in R which do not properly reflect the data structure in the back-end. @m-mohr For now results must be described as STAC elements. But for serializing raster time series or images with multiple bands, there is no recommended way of describing it, right?

At this point we are not able to get the exact data that is injected into the UDF, because

  • there is not user filesystem at the moment (2022-02-18)
  • each back-end chunks the data differently to achieve best performance

As an intermediate solution we can retrieve sample data before a UDF shall be run and the user can the experiment with the data that is returned in a convenient way (probably a stars object that will also be used inside the R-UDF).

Maybe as an addition to the list before:
7. due to the complexity of the users UDF function the processing time can be very slow depending on whether the processing has to be done for each element or it vectorized functions can be used

A first version is now available in the develop branch. You can now get a sample with get_sample(). A vignette with some examples will follow.