google / weather-tools

Tools to make weather data accessible and useful.

Home Page:https://weather-tools.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`target_path` templates should include be totally compatible with Python format strings

alxmrs opened this issue · comments

target_path (and target_filename) should be totally compliant with Python's standard string formatting. This includes being able to use named arguments (e.g. 'gs://bucket/{year}/{month}/{day}') as well as specifying formats for strings (e.g.'gs://bucket/{year:04d}/{month:02d}/{day:02d}').

Right now, we have a two path templating system for our config language. The default way to template files is to use {} as placeholders for values specified in the partition_key section – the target_path. See this config, for example. In addition, we have a system to create files in a data hierarchy via a boolean flag. The process for doing this is documented here.

There are a few problems with the date-name hierarchy approach. Mainly, it causes confusion for users and introduces more areas for code to go wrong. For example, #127 is an issue that has cropped up due to this more complex implementation. Instead, it would be preferable if the values referenced by the partition_keys could be used in the target_path via python's Format string syntax. Then, users could format file patterns in totally arbitrary ways without us having to individually support corner cases. In addition to date hierarchies, users are currently not able to express that an integer string has more than one digit. For example, If I had a config like:

[parameters]
target_path=gs://my-bucket/{}-data.nc
partition_keys=
   days
[selection]
days=1/2/3/4

I have no way of expressing paths like gs://my-bucket/01-data.nc, gs://my-bucket/02-data.nc etc. These require that in the template, that I use something like gs://my-data/{days:02d}.nc.

To support this, somewhere, we basically need to run:

target_template.format(parameter_keys.values(), **parameter_keys)

Today, this is approximately done here:

def prepare_target_name(config: Config) -> str:

In fact, a naive implementation of this issue would involve:

  • Deleting the append_date_dir code (or, raising an error if a user tries this)
  • structuring partition_key_values as an ordered dictionary
  • Calling format
  • Updating the process_config function to encourage correct usage of the parser.
  • Document the usage everywhere

A problem that you would run into is that pretty much all of the values in partition_key_values are strings! You can't format a string like you would an int. Thus, you would not be able to use formatting options like {days:02d}.nc. Thus, a pre-requisite ticket is required – #5.

Note: Ideally, a side effect of implementing this change (and it's siblings) is that the following method:

def process_config(file: t.IO) -> Config:

Does not fundamentally alter the data in the config. Specifically, the code around this block:

if use_date_as_directory(config):

Should no longer be needed, and can be removed.

There's a tricky case here to handle: partition_keys that have a date. These can't be parsed like integers, and it's hard to specify all of the fields in the target template (e.g., I want to give day and month two digits, and year four digits).

I think the best / simplest solution here would be to parse the date fields as python datetime.date objects. Then, we can encourage config writers to use python's date string formatting function (native to the call to format) to incorporate date information into the config target path.

See this SO post: https://stackoverflow.com/a/22842734

For example, after the change, the MARS config string with append_date_dir could look like:

[parameters]
client=cds
dataset=reanalysis-era5-pressure-levels
# This config creates a date-based directory hierarchy.
# In this case, the two files that will be created are
# gs://ecmwf-output-test/era5/2017/01/01-pressure-500.nc
# gs://ecmwf-output-test/era5/2017/01/02-pressure-500.nc
# gs://ecmwf-output-test/era5/2017/01/01-pressure-1000.nc
# gs://ecmwf-output-test/era5/2017/01/02-pressure-1000.nc
target_filename=
target_path=gs://ecmwf-output-test/era5/{:%Y/%m/%d}-pressure-{}.nc
partition_keys=
     date
     pressure_level
[selection]
product_type=reanalysis
format=netcdf
variable=
    divergence
    fraction_of_cloud_cover
    geopotential
pressure_level=
    500
    1000
date=2017-01-01/to/2017-01-02
time=
    00:00
    06:00
    12:00
    18:00