google / weather-tools

A helpful step towards fixing #21.

ticket updated: 2022-03-24

As a weather tools user, I would like to be able to preview the effect of each tool before incurring the cost of data movement and infrastructure. These light-weight previews will help me test pipelines before deployment, lowering the number of iterations needed to set up a data pipeline. For this issue, I want to be able to perform dry runs with the weather mover.

Acceptance Criteria

Provide a common interface for exercising dry runs for every Data Sink
When a user passes the -d or --dry-run flag to the weather-mv cli, this feature will be activated.
When a user checks tool documentation (the README or CLI help message), they will have a good understanding of what the feature does
As a user, I will still have some way to monitor the execution flow of the tool during a dry run
- Log messages from non-dry runs will remain the same as those within a dry-run
- As a user, I can inspect the kinds of messages that would have been written to BigQuery.
- (optional) Maybe more log messages are needed to see what's happening?
As a user, I can execute dry runs locally or remotely on Dataflow
Where appropriate, data is simulated in memory. No data is written to disk or cloud storage during a dry run.
- As a user, I still would like to validate execution on actual user-suppled URIs.
Where there are contradictions in requirements, the ergonomic option for weather-mv users is preferred.
All code should be completely covered by tests.

Implementation Notes

The best place to provide a common interface for dry runs is

weather-tools/weather_mv/loader_pipeline/sinks.py

Line 33 in fd0c5e4

class ToDataSink(abc.ABC, beam.PTransform):
Most code changes will happen in this class

weather-tools/weather_mv/loader_pipeline/bq.py

Line 41 in fd0c5e4

class ToBigQuery(ToDataSink):

Add support for dry-runs to `weather-mv`.

Acceptance Criteria

Implementation Notes