NVIDIA / earth2mip

Earth-2 Model Intercomparison Project (MIP) is a python framework that enables climate researchers and scientists to inter-compare AI models for weather and climate.

Home Page:https://nvidia.github.io/earth2mip/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Earth-2 MIP Data flow

NickGeneva opened this issue · comments

Discussion about Data flow

Opening this issue to have a forum about looking at the dataflow inside Earth-2 MIP.

Context: To better support users whose use cases are software development, consider how to future proof for different grids / structures, etc.

The Current

Presently data is exchanged between parts of earth2mip via numpy/pytorch tensors. This data is physical, thus some meta data needed to describe what exactly this data is representing. We currently do this through a set of properties assigned to each of these components which return either python primitives or alternatively some object (grid schema).

This means that to communicate these required properties we need package wide concepts such as geooperator / timestep / timestepper. Additionally we need global schemas for these coordinate systems in some cases (although we have been slowing moving away from these).

For the most part in the bounds of this package this has worked, granted the natural coupling between the package wide interfaces has caused the addition of new models to be a little painful and sometime more challenging to debug but documentation can likely fix that.

The Issue

Generally I see two issues that with this:

  • Presently its more difficult to use components of earth2mip in isolation. Without pure functions, we require users / developers to always have clear knowledge on the properties present / needed. You end up in cycles of property updates for different workflows.
  • Updates to the schemas / property requirements have rolling effects across the package... if one model has additional / unique needs the coupling between components can result in an update being challenging.

Outside of that I've been thinking about just the general concepts of how can people better understand how information moves in the package. Just the data alone can reveal a lot for users but there needs to be meta data with it that is present hard to get to which can/has lead to challenging debugging.

Proposal

What I would like to propose is switching to a more pure function approach in some sense where both the data array and meta data are always passed together. So you know that when messaging between components you always have sufficient information being provided.

Off GPU:

  • numpy array + OrderedDict[str, ndarray]

On GPU

  • tensor + OrderedDict[str, ndarray]

Note: I propose the meta data being an ordered dict. The use of a built in python primitive is intentional here as this enables components outputs to be used without coupling on earth2mip. Its ordered because this allows the dims of the data to be specified.

I see this having a number of advantages with a few disadvantages that I think are worth the sacrifice.

Advantages

  • Reduce coupling between components, components do not need to rely on each other for information. All info is provided on call. This simplifies the object instantiation and increased modularity.
  • Eliminate need for global interfaces (geo operator, timestep, timestepper) since these properties are now not needed. Can focus on component level interfaces instead
  • Reduces complexity of each component's interface, makes contribution easier
  • Flexible to other coordinate systems
  • Enables components to be easily used in other development packages
  • Likely allows better consistency between component apis (doesnt matter if you just need channels and time, you injest all coordinate data and parse)
  • Still performant. Reliance on tensors /numpy array for data array means we are still fast. Performance hit from coordinate logic unlikely

Disadvantages

  • Each component needs to check the coordinate input to make sure its valid. We can provide utils for this but its additional work. That being said this enables consistent informative errors for debugging
  • The cooridnates in the OrderedDict cant really be typed checked. Its not static, will need to be checked at run time.
  • This is a big change for the package

Development

To implement this change would be a large effort... I would propose doing it component by component slowly integrating into existing workflows. Complexity of the workflows will increase during this time due to adapters needing to get implemented, but if we focus on component level the workflows will greatly clean up after the fact.

Points to Discuss

tldr: point Im wondering about

  1. Move from present property + array data flow to pure meta data + array
  2. If 1, Meta data object being python primitive Ordered Dict.

Thanks for opening this issue @NickGeneva. It will take me some time to parse. Starting with "The current"

The Current

Presently data is exchanged between parts of earth2mip via numpy/pytorch tensors. This data is physical, thus some meta data needed to describe what exactly this data is representing. We currently do this through a set of properties assigned to each of these components which return either python primitives or alternatively some object (grid schema).

Perhaps would be helpful to give some examples of this existing objects.

This means that to communicate these required properties we need package wide concepts such as geooperator / timestep / timestepper. Additionally we need global schemas for these coordinate systems in some cases (although we have been slowing moving away from these).

Let'd define "schema".

The earth2mip.grid.LatLonGrid objects are no longer earth2mip.schema hardcoding 721x1440 etc. They are more flexible and encompass any lat/lon grid. Do we have any concrete use cases that do not use lat/lon grids? If so, we can add a earth2mip.grid.Unstructured.

For the most part in the bounds of this package this has worked, granted the natural coupling between the package wide interfaces has caused the addition of new models to be a little painful and sometime more challenging to debug but documentation can likely fix that.

What code objects are the trouble?

Can you provide specific examples of trouble adding new models? I feel a lot of the trouble came from before the more recent batch of APIs (DataSource, TimeLoop) were formalized and when we still used enum objects in earth2mip.schema for the grid and channels. Another difficulty was using arth2mip.networks.Inference for all models, but we no longer do that.

The Issue

Generally I see two issues that with this:

  • Presently its more difficult to use components of earth2mip in isolation. Without pure functions, we require users / developers to always have clear knowledge on the properties present / needed. You end up in cycles of property updates for different workflows.

Let's unpack "pure function" more. To me this means a function without state. Turning such a function into a class that has some properties for metadata does not make the overall use less pure.

  • Updates to the schemas / property requirements have rolling effects across the package... if one model has additional / unique needs the coupling between components can result in an update being challenging.

What is the rolling effect specifically in the linked PR? While motivated by graphcast the updates in that PR fixed bugs in the scoring that could appear with other models (e.g. assuming input_channels == output_channels). Just because graphcast is the first example of such a model that we have added, does not mean it is a model specific issue.

Outside of that I've been thinking about just the general concepts of how can people better understand how information moves in the package. Just the data alone can reveal a lot for users but there needs to be meta data with it that is present hard to get to which can/has lead to challenging debugging.

I think the main debate here is static vs dynamic structure. I feel users may benefit from a dynamic interface, but the more static structure (e.g. grid, channels etc live on the object and can be checked ahead of time) are essential for writing pipelines that work at scale.

Decision, was this would be too invasive and current data flow works. So lets keep it.

Thanks for the input!