SSB Timeseries

Background

Statistics Norway is the national statistics agency in Norway. We are building a new procuction system in the cloud and moving towards a modern architecture based on open source technologies.

Time series play a key role in the statistics production process.

Our mission comes with strict requirements for transparency and data quality. Some are mandated by law, others stem from commitment to international standards.

The data itself has a wide variety, but time resolution and publishing frequencies are typically low. While volumes are sometimes significant, they are far from extreme. Quality and reliability is by far more important than latency. This shifts the focus towards process and data control.

This libarary came out of a PoC to demonstrate how key functionality could be provided in alignment with architecture decisions and process model requirements.

At the core is storage with performant read and write, search and filtering
Good descriptive metadata is key to findability
A wide selection of math and statistics libraries is key for calculations and models
Visualisation tools play a role both in ad hoc and routine inspection and quality control
Workflow integration with automation and process monitoring help keeping consistent quality
Data lineage and process metadata is essential for quality control

It is constructed to be an abstraction between the storage and automation layers and the statistics production code. providing a way forward while postponing some technical choices.

How to get started?

Install by way of poetry add ssb_timeseries.
The library should work out of the box with default settings. Note that the defaults are for local testing, ie not be suitable for the production setting.
To apply custom settings: The environment variable TIMESERIES_CONFIG should point to a JSON file with configurations.
The command poetry run timeseries-config <...> can be run from a terminal in order to shift between defauls.
Run poetry run timeseries-config home to create the environment variable and a file with default configurations in the home directory, ie /home/jovyan in the SSB Jupyter environment (or the equivalent running elsewhere).
The similar poetry run timeseries-config gcs will put configurations and logs in the home directory and time series data in a shared GCS bucket gs://ssb-prod-dapla-felles-data-delt/poc-tidsserier. Take appropriate steps to make sure you have access. The library does not attempt to facilitate that.
With the environment variable set and the configuration in place you should be all set. See the reference https://statisticsnorway.github.io/ssb-timeseries/reference.html

Note that while the library is in a workable state and should work both locally and (for SSB users) in JupyterLab, it is still in early development. There is a risk that fundamental choices are reversed and breaking changes introduced.

With that disclaimer, feel free to explore and experiment, and do not be shy about asking questions or giving feedback.

Functionality overview

The core of the library is the Dataset class. This is essentially a wrapper around a DataFrame (for now Pandas, later probably Polars) in the .data attribute.

The .data attribute should comply to conventions implied by the underlying information model. These will start out as pure conventions and subject to evalutation. At a later stage they are likely to be enforced by Parquet schemas. Failing to obey them will cause some methods to fail.

The Dataset.io attribute connects the dataset to a helper class that takes care of reading and writing data. This structure abstracts away the IO mechanics, so that the user do not need to know about the "physical" details, only the information model meaning of the choices made.

Read and write for both versioned and unversioned data types.
Search for sets by name, regex and (planned for later) metadata.
Basic filtering of sets (selecting series within a selected set).
Basic linear algebra: Datasets can be added, subtracted, multiplied and divided with each other and dataframes, matrices, vectors (untested) and scalars according to normal rules.
Basic plotting: Dataset.plot() as shorthand for Dataset.data.plot().
Basic time aggregation: Dataset.groupby(<frequency>, 'sum'|'mean'|'auto')

The information model

TLDR

Types are defined by
Versioning defines how updated versions of the truth are represented: NONE overwrites a single version, NAMED or AS_OF maintaines new "logical" versions identified by name or date.
Temporality describes the "real world" valid_at or valid_from - valid_to datetime of the data. It will translate into columns, datetime or period indexes of Dataset.data.
Value type (only scalars for now) of Dataset.data "cells".
Datasets can consists of multiple series. (Later: possible extension with sets of sets.)
All series in a set must be of the same type.
Series are value columns in Datasets.data, rows identified by date(s) or index corresponding temporality.
The combination <Dataset.name>.<Series.name> will serve as a globally unique series identifier.
<Dataset.name> identifies a "directory", hence must be unique. (Caveat: Directories per type creates room for error.)
<Series.name> (.data column name) must be unique within the set.
Series names should be related to (preferrably constructed from) codes or meta data in such a way that they can be mapped to "tags" via a format mask (and if needed a translation table).

Yes, that was the short version. The long version is still pending production.

Internal documentation:

API-documentation

The documentation is published on GitHub Pages. See the API reference for API-documentation.

Contributing

Contributions are very welcome.

For SSB internals, assuming you have Python working with a standard SSB setup for git and poetry etc, the following should get you going:

# Get the poc package
git clone https://github.com/statisticsnorway/arkitektur-poc-tidsserier.git

# Run inside a poetry controlled venv:
poetry shell
## Create default config
poetry run timeseries-config home
# Run the tests to check that everything is OK:
poetry run pytest
# A couple of the test cases *are expected* fail when running for the first time in a new location.
# They should create the structures they need and should succeed in subsequent runs.

See the Contributor Guide to learn more.

License

Distributed under the terms of the MIT license, SSB Timeseries is free and open source software.

Issues

If you encounter any problems, please file an issue along with a detailed description.

Credits

This project was generated from Statistics Norway's SSB PyPI Template.

statisticsnorway / ssb-timeseries