google / weather-tools

Tools to make weather data accessible and useful.

Home Page:https://weather-tools.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Convert `Config` dict into a typed Dataclass

alxmrs opened this issue · comments

Config has outgrown a dict of dicts representation. We should convert it to a dataclass and update the parser to output a type dataclass. This will prevent easy-to-make typo errors and make the code cleaner.

Originally posted by @fredzyda in #117 (comment)

xref: #5

To expand on the details of this issue...

Config = t.Dict[str, t.Dict[str, Values]]

Config should be a Dataclass. Each field of the dictionary (as outlined and enforced in the parser, see below) should be typed and documented accordingly. It's fine if some key-value pairs are dictionaries or of any type (e.g., like, the whole selelction section), but we should type things when we can.

def process_config(file: t.IO) -> Config:

I think if we complete #5 by using protobuf, we might get this one for free. I think #5 makes sense as the first thing to work on, then maybe we should reevaluate this one.

As discussed today in standup and in #5 (comment), it does in fact make sense for us to maintain a parser (and not move to protobufs) for the downloader's config language.

Given that, I'd like to offer pointers to how a config should be structured. This will prove key to creating the dataclass-based abstraction for configs in this issue.

A Config is made of two sections of key-value pairs. The Parameters section, documented here has a finite set of required and optional fields, each with specific types. In addition, the Parameters section should be able to detect arbitrary key-value pairs, and these should be passed all the way to each client (basically, as keyword-arguments).

An additional complexity to these generic key-value pairs of the parameters section: Parameters can take named dictionaries of arbitrary key-value pairs in the form of parameter subsections (documented here).

Next, the Selection section is a core component of a Config. Selections are key-value pairs that map strings to metadata that determine what should be ingested from an archive. The types for the values in the selection section need to be much more inclusive. For the MARS client, for example, ECMWF provides a MARS Syntax. The config language, especially the selection section, is modeled after this syntax. These support basic primitive types (strings, ints, floats, times, dates) as well as lists of items (items are delimited by /). They also provide some syntax for expressing ranges of values. Our ConfigParser respects these conventions via our own implementation (maybe it could have a better method name):

def parse_mars_syntax(block: str) -> t.List[str]:

At minimum, the selection section should map strings to values of any type that the MARS syntax allows. This is mostly implemented already via the type alias Values:

Values = t.Union[t.List['Values'], t.Dict[str, 'Values'], bool, int, float, str] # pytype: disable=not-supported-yet


The above is a good general background of what's intended in this config language. This should prove useful context for this and similar issues. Now, I'd like to talk about specific implementation details to close this bug.

To close this bug in isolation of all others, the Config type alias needs to be replaced by a dataclass where all the common properties (in the parameters section) are represented as fields of the dataclass, while allowing all possible fields to be represented within the dataclass (e.g. arbitrary selection sections).

Another perspective on this issue: Anytime you see a line of code that gets a value from a dictionary by name – especially, it it also has to type cast that value after getting it! – it could be made into a field of the dataclass.

For example:

target_path = t.cast(str, parameters.get('target_path', ''))

commented

We have identified following two approaches for the structure of Dataclass, @alxmrs please check and share your feedback,

Approach 1

Create separate dataclasses for Config, Parameters, SubSections & SubSectionFields.
For all subsections defined in the config file, a SubSections object (containing SubSectionFields) will be created and stored in a list against the subsection field of the Parameters object.
Config-Dataclass will be created using Parameters object & selection field.

Values = t.Union[t.List['Values'], t.Dict[str, 'Values'], bool, int, float, str]  # pytype: disable=not-supported-yet

@dataclass(init = False)
class SubSectionFields:
    api_key: str
    api_url: str

@dataclass(init = False)
class SubSections:
    name: str
    value: SubSectionFields

@dataclass(init = False)class Parameters:
    client: str
    dataset: str
    target_path: str
    target_filename: str
    partition_keys: t.List[str] 
    subsections: t.List[SubSections]
    api_key: str
    api_url: str
    __subsection__: str
    num_api_keys: int
    force_download: bool
    user_id: str

@dataclass()
class Config:
    parameters: Parameters
    selection: t.Dict[str, Values]

Usage :

config = Config(Parameters(),{})
config.parameters.client = "cds"
config.parameters.dataset = "ecmwf-mars-output"

Approach 2

A new dataclass will be created using dataclasses.make_dataclass.
For this new dataclass, the base will be Parameters data-class & fields will be one or more subsections.
For more details on make_dataclass, please refer to Documentation Link and SO Link.

Values = t.Union[t.List['Values'], t.Dict[str, 'Values'], bool, int, float, str]  # pytype: disable=not-supported-yet

@dataclass(init = False)
class SubSectionFields:
    api_key: str
    api_url: str

@dataclass(init = False)
class Parameters:
    client: str
    dataset: str
    target_path: str
    target_filename: str
    partition_keys: t.List[str] 
    api_key: str
    api_url: str
    __subsection__: str
    num_api_keys: int
    force_download: bool
    user_id: str

@dataclass
class Config:
    parameters: Parameters
    selection: t.Dict[str, Values]

Usage

config = Config(Parameters(),{})
config.parameters.__class__ = make_dataclass('X', fields=[('deepmind', SubSectionFields),('cloud', SubSectionFields)], bases=(Parameters,))
config.parameters.client='cds'
config.parameters.deepmind=SubSectionFields()
config.parameters.deepmind.api_key="uuuu1"

Fixed in #142 .