pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.

Home Page:https://pangeo-forge.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Streaming pipelines subscribed to source data arriving in a cache bucket

cisaacstern opened this issue · comments

Over in leap-stc/data-management#49 (comment) @alxmrs suggested this as a way to integrate Pangeo Forge and weather-dl. This seems like a great pattern to adopt, and would be broadly useful for any slow caching operations (using any out-of-band caching operation, not exclusively weather-dl). ECMWF here serves is a motivating (extreme) example of a generalized problem for which this could be a very desirable solution.

From his experience with stream windowing in the context of https://github.com/bytewax/bytewax, @rabernat suggested a very elegant idea that timers could be configured such that the timestamps used to label events are not the wall time when that data arrives in the cache, but rather a key corresponding to the indexed position the cached data represents in the FilePattern. (A less general case would be the timestep the data represents in the concat dim. We could start this way, using a concat-only recipe, but for accommodating n-dimensionality it probably would need to an n-dimensional key, not unlike what is used for the Rechunk transform's GroupByKey.)

Processing would then be configured to begin once a logically complete set of IndexedPositions was cached, which for a first processing group in the stream would be the same set as would otherwise be generated by pattern.items(). Subsequent triggers could be for smaller append-able units.

xref #447 for appending and #570 for caching

Quick question for now: why not use Beam’s built-in
streaming primitives? The KISS solution here, IMO, is to build off of
Beam’s PubSub and Kafka connectors to react to bucket event updates.

💯 we should do this.

In which case this may be more a matter of documenting best practices rather than adding (much, any?) code here.