manas96 / anaximander

The rapid application development framework for data-intensive Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Anaximander

Anaximander aims to evolve into a full-fledge rapid application development framework for data-intensive backends. In its initial release, currently under development, it will take the form of a data modeling tool that facilitates storage and integration across multiple database technologies. Modern data applications are polyglot, in the sense that they use a plurality of storage engines -say a relational database for modeling entities like people, places and things, a document database for nested specifications, a columnar database for time series, and a cloud data warehouse for analytics and report generation. Anaximander allows Python developers to declare all application data models as object classes, using unifying semantics. Hence it generalizes the concept of Object-Relational Mapper to no-SQL databases, and enables programmers to work with richly featured objects that embed the application's domain modeling. Further, the object-oriented paradigm automatically extends to data collections, either in the form of the built-in list, set and dict primitives, or in the form of vectorized Series and Dataframes from the pandas library. For instance, if an application programmer declares a TemperatureProbe entity model and a TemperatureSample record model, then typing nx.Set[TemperatureProbe] or nx.Table[TemperatureSample] returns concrete object classes that have been programmatically generated by the framework's metaclasses and that carry metadata and attributes derived from the model class declarations. Among other advantages, this provides automated data validation, intuitive data exploration, and clean, expressive programming.

At the highest level, the framework considers three primary data types: entities, records and specs.

  • Entities represent discrete, identifiable application constructs such as physical assets, products, people, places, etc.
  • Records are purely informational objects that represent observations, such as sensor readings, sales reports, or alert notifications.
  • Specs are arbitrarily nested key-value mappings that capture specifications or configurations with a flexible schema, be it detailed entity attributes, schedules or model parameters.

Generally speaking, entities are relational and fit best in a relational database or an object store. Records are indexed by key and time, making them equivalent to event messages, and fit well in columnar databases or data lakes. In the data warehouse terminology, entities and records are analogous to dimensions and facts. Finally, specs are document-like objects that may be stored in a document database, a data lake, or embedded in a JSON-typed column as part of an entity table.

For a basic usage demonstration, consider an industrial IoT application in which temperature probes are deployed on manufacturing machines. The probes send regular messages containing temperature readings, which are logged and summarized on preset periodicities. Besides, a stream processor analyzes the values in near-real time to emit overheat alerts, and any such occurrence eventually gets logged as an overheat session, i.e. a time interval during which overheating conditions were met.

The following code is a partial demonstration, using the current implementation state. The most notable gaps separating it from the target release are as follows:

  • Model relations are still lacking, particularly the ability to link records to entities.
  • Tabular data indexing is still limited.
  • Most crucially, there is no I/O yet. The first step will be to link models to Arrow datasets so that tabular data can be imported and exported to and from the parquet format. JSON and .csv formats will also be available. Once this is done, the framework will be integrated with database engines.
  • Tabular data validation works but with a sub-optimal implementation that converts dataframes to individual records and back. This will be fixed by integrating the panderas library.

Installation is straightforward:

pip install anaximander

Here are model declarations:

from datetime import datetime
from typing import Optional

import anaximander as nx
from anaximander.operators import Sessionizer
import numpy as np
import pandas as pd

# =========================================================================== #
#                              Model declarations                             #
# =========================================================================== #

# Entities are identifiable things
class Machine(nx.Entity):
    id: int = nx.id()
    machine_type: str = nx.data()
    machine_floor: Optional[str] = nx.data()

# Measurements feature units that get printed
# The metadata is carried into model schemas that use this data type and
# can be used by plotting libraries, or for unit conversions.
# Also note the validation input (greater or equal to -273), using
# Pydantic's notations.
class Temperature(nx.Measurement):
    unit = "Celsius"
    ge = -273

# Samples are timestamped records expected to show up at a somewhat set frequency,
# though not necessarily strictly so. In other words, the freq metadata is
# used as a time characteristic in summarization operations, but missing or
# irregular samples are tolerated.
# Note that the 'machine_id' field will eventually be replaced by a relational
# 'machine' field of type Machine. This functionality is still pending.
# Also note that the temperature field defines its own validation parameters,
# supplemental to those already defined in the Temperature class (not easy!)
class TemperatureSample(nx.Sample):
    machine_id: int = nx.key()
    timestamp: datetime = nx.timestamp(freq="5T")
    temperature: Temperature = nx.data(ge=0, le=200)

# Unlike samples, Journals are strictly periodic -by construction, since they
# are intended as regular summaries, and hence feature a period field, whose
# type is a pandas Period.
class TemperatureJournal(nx.Journal):
    machine_id: int = nx.key()
    period: pd.Period = nx.period(freq="1H")
    avg_temp: Temperature = nx.data()
    min_temp: Temperature = nx.data()
    max_temp: Temperature = nx.data()

# Spec models are intended as general-purpose nested documents, for storing
# specifications, configuration, etc. They have no identifier because they
# always have an 'owner' -typically an Entity or Record object. This bit is
# not implemented yet. If it was, the Machine model would carry an operating
# spec as a data attribute.
# Here the spec defines the nominal operating temperature range, and will
# be used to compute overheat sessions.
class MachineOperatingSpec(nx.Spec):
    min_temp: Temperature = nx.data()
    max_temp: Temperature = nx.data()

# Sessions are timestamped-records with two entries: a start and end times.
# These are ubiquitous in natural data processing, particularly for aggregating
# events, such as oveheat events in this case.
class OverheatSession(nx.Session):
    machine_id: int = nx.key()
    start_time: datetime = nx.start_time()
    end_time: datetime = nx.end_time()

# This is a very rudimentary implementation of a parametric operator that will
# be used to compute overheat sessions. Note that the Sessionizer class reads
# metadata from the TemperatureSample class, such as the names of the key
# and timestamp field, as well as the timestamp frequency.
sessionizer = Sessionizer(TemperatureSample, OverheatSession, feature="temperature")

# In Anaximander, types are composable, and automatically assembled without
# the need for a class declaration. Here we explicitly name a class for
# tables of temperature samples. Unlike the use of generics in type annotations,
# Table[TemperatureSample] is an actual class -which is cached once it is
# created.
# Table is a so-called archetype, and Table[TemperatureSample] is a concrete
# subtype. A table instance wraps a dataframe along with metadata inherited
# from its class -primarily the model's schema, which is used to conform and
# validate the data.
TempTable = nx.Table[TemperatureSample]

And some data inputs:

# =========================================================================== #
#                                 Data inputs                                 #
# =========================================================================== #

# Machine instance and operating spec
m0 = Machine(id=0, machine_type="motor")
m0_spec = MachineOperatingSpec(min_temp=40.0, max_temp=55.0)

# Building a table of temperature samples
times = pd.date_range(start="2022-2-18 12:00", freq="5T", periods=12)
temperatures = [45.0, 46.0, 45.0, 50.0, 59.0, 50.0, 48.0, 51.0, 52.0, 56.0, 58.0, 53.0]
sample_log = TempTable(dict(machine_id=0, timestamp=times, temperature=temperatures))

# Computing stats over an hour to fill a Journal instance. Note that eventually
# this kind of summarization will be specified in model declarations and
# carried out by operators -which will be at least partially automated, see
# the sessionizer for a prototype.
avg_temp = round(np.mean(temperatures), 0)
min_temp = min(temperatures)
max_temp = max(temperatures)
hourly = TemperatureJournal(
    machine_id=0,
    period="2022-2-18 12:00",
    avg_temp=avg_temp,
    min_temp=min_temp,
    max_temp=max_temp,
)

# Computing overheat sessions
# Note that the key and threshold will not be necessary once relations are
# established between models -the sample log will be able to point to a machine
# as part of its metadata, and the machine will own its operating spec,
# providing a path to the threshold.
overheat_sessions = sessionizer(sample_log, key=0, threshold=m0_spec.max_temp.data)

And some evaluations:

# TempTable exposes a data frame, that is automatically indexed by key
# and timestamp
>>> assert isinstance(sample_log.data, pd.DataFrame)
>>> print(sample_log)
                                temperature
machine_id timestamp                       
0          2022-02-18 12:00:00         45.0
           2022-02-18 12:05:00         46.0
           2022-02-18 12:10:00         45.0
           2022-02-18 12:15:00         50.0
           2022-02-18 12:20:00         59.0
           2022-02-18 12:25:00         50.0
           2022-02-18 12:30:00         48.0
           2022-02-18 12:35:00         51.0
           2022-02-18 12:40:00         52.0
           2022-02-18 12:45:00         56.0
           2022-02-18 12:50:00         58.0
           2022-02-18 12:55:00         53.0
# TempTable can broken down into individual records of type TemperatureSample.
# These feature temperature attributes with unit metadata.
# Likewise, the temperature attribute of the TempTable is a series of
# temperatures, and individual data points are measurements with metadata.
>>> r0 = next(sample_log.records())
>>> assert isinstance(r0, TemperatureSample)
>>> print(r0.temperature)
>>> assert next(sample_log.temperature.values()) == r0.temperature
45.0 Celsius
# Temperature defines a lower bound, which is used to validate inputs
>>> try:
>>>     Temperature(-300)
>>> except ValueError as e:
>>>     print(e)
Could not validate -300.0 as a <nxtype:Temperature> instance
# TemperatureSample defines its own bounds as well
# This would also work if one tried to directly instantiate a table of
# temperature samples from a dataframe -though in the current implementation
# the framework converts the data frame to records and uses Pydantic, which
# is obviously very inefficient. The target is to use the Panderas library,
# with no change in the interface.
>>> try:
>>>     TemperatureSample(machine_id=m0.id, timestamp=times[0], temperature=250)
>>> except ValueError as e:
>>>     print(e)
Could not validate machine_id                       0
timestamp      2022-02-18 12:00:00
temperature                  250.0
dtype: object as a <nxtype:TemperatureSample> instance
# Here is a printout of our Journal instance
>>> print(hourly)
machine_id                   0
period        2022-02-18 12:00
avg_temp                  51.0
min_temp                  45.0
max_temp                  59.0
dtype: object
# And finally a printout of our overheating sessions' timespans, a pandas
# IntervalIndex
>>> print(overheat_sessions.timespans)
IntervalIndex([[2022-02-18 12:17:30, 2022-02-18 12:22:30), [2022-02-18 12:42:30, 2022-02-18 12:52:30)], dtype='interval[datetime64[ns], left]', name='timespan')

About

The rapid application development framework for data-intensive Python

License:Mozilla Public License 2.0


Languages

Language:Python 100.0%