tanduong / mlops-python-package

MLOps Python Package

This repository contains a Python package implementation designed to support MLOps initiatives.

The package uses several tools and tips to make your MLOps experience as flexible, robust, productive as possible.

You can use this package as part of your MLOps toolkit or platform (e.g., Model Registry, Experiment Tracking, Realtime Inference, ...).

Install

This section details the requirements, actions, and next steps to kickstart your project.

Prerequisites

Python>=3.11 (to benefit from the latest features and performance improvements)
Poetry>=1.5.1 (to initialize the project virtual environment and its dependencies)

Installation

Clone this GitHub repository on your computer

# with ssh (recommended)
$ git clone git@github.com:fmind/mlops-python-package
# with https
$ git clone https://github.com/fmind/mlops-python-package

Run the project installation with poetry

$ cd mlops-python-package/
$ poetry install

Adapt the code base to your desire

Next Steps

Going from there, there are dozens of ways to integrate this package to your MLOps platform.

For instance, you can use Databricks or AWS as your compute platform and model registry.

It's up to you to adapt the package code to the solution you target. Good luck champ!

Usage

This section explains how configure the project code and execute it on your system.

Configuration

You can add or edit config files in the confs/ folder to change the program behavior.

# confs/training.yaml
job:
  KIND: TrainingJob
  inputs:
    KIND: ParquetDataset
    path: data/inputs.parquet
  target:
    KIND: ParquetDataset
    path: data/target.parquet
  output_model: outputs/model.joblib

This config file instructs the program to start a TrainingJob with 3 parameters:

inputs: dataset that contains the model inputs
target: dataset that contains the model target
output_model: output path to the model artifact

You can find all the parameters of your program in the src/[package]/jobs.py.

Execution

The project code can be executed with poetry during your development:

$ poetry run [package] confs/tuning.yaml
$ poetry run [package] confs/training.yaml
$ poetry run [package] confs/transition.yaml
$ poetry run [package] confs/inference.yaml

In production, you can build, ship, and run the project as a Python package:

poetry build
poetry publish # optional
python -m pip install [package]
[package] confs/transition.yaml

You can also install and use this package as a library for another AI/ML project:

from [package] import jobs

job = jobs.TrainingJob(...)
with job as runner:
    runner.run()

Automation

This project includes several automation tasks to easily repeat common actions.

You can invoke the actions from the command-line or VS Code extension.

# execute the project DAG
$ inv dag
# create a code archive
$ inv package
# list other actions
$ inv --list

Available tasks:

bump.release (bump): Bump a release: major, minor, patch.
bump.version: Bump to the new version.
check.all (check): Run all check tasks.
check.code: Check the codes with pylint.
check.coverage: Check the coverage with coverage.
check.format: Check the formats with isort and black.
check.poetry: Check poetry config files.
check.test: Check the tests with pytest.
check.type: Check the types with mypy.
clean.all (clean): Run all clean tasks.
clean.coverage: Clean coverage files.
clean.dist: Clean the dist folder.
clean.docs: Clean the docs folder.
clean.install: Clean the install.
clean.mypy: Clean the mypy folder.
clean.outputs: Clean the outputs folder.
clean.pytest: Clean the pytest folder.
clean.python: Clean python files and folders.
clean.reset: Reset the project state.
dag.all (dag): Run all DAG tasks.
dag.job: Run the project for the given job name.
docker.all (docker): Run all docker tasks.
docker.build: Build the docker image.
docker.run: Run the docker image.
docs.all (docs): Run all docs tasks.
docs.api: Document the API with pdoc.
docs.serve: Document the API with pdoc.
format.all (format): Run all format tasks.
format.imports: Format code imports with isort.
format.sources: Format code sources with black.
install.all (install): Run all install tasks.
install.poetry: Run poetry install.
install.pre-commit: Run pre-commit install.
package.all (package): Run all package tasks.
package.build: Build a wheel package.

Tools

This sections motivates the use of developer tools to improve your coding experience.

Note: tools with an exclamation mark (!) can be further optimized based on your constraints.

Automation

Pre-defined actions to automate your project development.

Commit: Pre-Commit

Motivations:
- Check your code locally before a commit
- Avoid wasting resources on your CI/CD
- Can perform extra (e.g., file cleanup)
Limitations:
- Add overhead before your commit
Alternatives:
- Git Hooks: less convenient to use

Release: Bump2version

Motivations:
- Easily change the package version
- Can modify multiple files at once
- Suited for SemVer versioning
Limitations:
- https://xkcd.com/1319/
Alternatives:
- Manual edits: less convenient, risk of forgetting a file

Tasks: PyInvoke

Motivations:
- Automate project workflows
- Sane syntax compared to alternatives
- Good trade-off between power/simplicity
Limitations:
- Not familiar to most developers
Alternatives:
- Make: most popular, but awful syntax

CLI

Integrations with the Command-Line Interface (CLI) of your system.

Parser: Argparse!

Motivations:
- Provide CLI arguments
- Included in Python runtime
- Sufficient for providing configs
Limitations:
- More verbose for advanced parsing
Alternatives:
- Typer: code typing for the win!
- Fire: simple but no typing
- Click: more verbose

Logging: Loguru

Motivations:
- Show progress to the user
- Work fine out of the box
- Saner logging syntax
Limitations:
- Doesn't let you deviate from the base usage
Alternatives:
- Logging: available by default, but feel dated

Code

Edition, validation, and versioning of your project source code.

Coverage: Coverage

Motivations:
- Report code covered by tests
- Identify code path to test
- Show maturity to users
Limitations:
- None
Alternatives:
- None

Editor: VS Code

Motivations:
- Open source
- Free, simple, open source
- Great plugins for Python development
Limitations:
- Require some configuration for Python
Alternatives:
- PyCharm: provide a lot, cost a lot
- Vim: I love it, but theres a VS Code plugin
- Spacemacs: I love it even more, but not everybody loves LISP

Formatting: Isort + Black

Motivations:
- Standardize your code format
- Don't waste time arranging your code
- Make your code more readable/maintainable
Limitations:
- Can be disabled in some case (e.g., test layout)
Alternatives:
- YAPF: more config options that you don't need

Quality: Pylint

Motivations:
- Improve your code quality
- Help your write better code
- Great integration with VS Code
Limitations:
- May return false positives (can be disabled locally)
Alternatives:
- Ruff: promising alternative, but no integration with VS Code
- Flake8: too much plugins, I prefer Pylint in practice

Testing: Pytest

Motivations:
- Write tests of pay the price
- Super easy to write new test cases
- Tons of plugins (xdist, sugar, cov, ...)
Limitations:
- Doesn't support parallel execution out of the box
Alternatives:
- Unittest: more verbose, less fun

Typing: Mypy

Motivations:
- Static typing is cool!
- Communicate types to use
- Official type checker for Python
Limitations:
- Can have overhead for complex typing
Alternatives:
- PyRight: check big code base by MicroSoft
- PyType: check big code base by Google
- Pyre: check big code base by Facebook

Versioning: Git

Motivations:
- If you don't version your code, you are a fool!
- Most popular source code manager (what else?)
- Provide hooks to perform automation on some events
Limitations:
- Git can be hard: https://xkcd.com/1597/
Alternatives:
- Mercurial: loved it back then, but git is the only real option

Configs

Manage the configs files of your project to change executions.

Format: YAML

Motivations:
- Change execution without changing code
- Readable syntax, support comments
- Allow to use OmegaConf <3
Limitations:
- Not support out of the box by Python
Alternatives:
- JSON: no comments, more verbose
- TOML: less suited to config merge/sharing

Parser: OmegaConf

Motivations:
- Parse and merge YAML files
- Powerful, doesn't get in your way
- Achieve a lot with few lines of code
Limitations:
- Do not support remote files (e.g., s3, gcs, ...)
Alternatives:
- Hydra: powerful, but gets in your way
- DynaConf: more suited for app development

Reader: Cloudpathlib

Motivations:
- Read files from cloud storage
- Better integration with cloud platforms
- Support several platforms: AWS, GCP, and Azure
Limitations:
- Support of Python typing is not great at the moment
Alternatives:
- Cloud SDK (GCP, AWS, Azure, ...): vendor specific, overkill for this task

Validator: Pydantic

Motivations:
- Validate your config before execution
- Pydantic should be builtin (period)
- Super charge your Python class
Limitations:
- What will happen with Pydantic 2?
Alternatives:
- Dataclass: simpler, but much less powerful
- Attrs: no validation, less intuitive to use

Data

Define the datasets to provide data inputs and outputs.

Container: Pandas

Motivations:
- Load data files in memory
- Lingua franca for Python
- Most popular options
Limitations:
- Only work on one core, lot of gotchas
Alternatives:
- Polars: faster, saner, but less integrations
- Pyspark: powerful, popular, distributed, so much overhead
- Dask, Ray, Modin, Vaex, ...: less integration (even if it looks like pandas)

Format: Parquet

Motivations:
- Store your data on disk
- Column-oriented (good for analysis)
- Much more efficient and saner than text based
Limitations:
- None
Alternatives:
- CSV: human readable, but that's the sole benefit
- Avro: good alternative for row-oriented workflow

Schema: Pandera

Motivations:
- Typing for dataframe
- Communicate data fields
- Support pandas and others
Limitations:
- Adding types to dataframes adds some overhead
Alternatives:
- Great Expectations: powerful, but much more difficult to integrate

Docs

Generate and share the project documentations.

API: pdoc!

Motivations:
- Share docs with others
- Simple tool, only does API docs
- Get the job done, get out of your way
Limitations:
- Only support API docs (i.e., no custom docs)
Alternatives:
- Sphinx: Most complete, overkill for simple projects
- Mkdocs: no support for API doc, which is the core feature

Format: Google

Motivations:
- Common style for docstrings
- Most writeable out of alternatives
- I often write a single line for simplicity
Limitations:
- None
Alternatives:
- Numpy: less writeable
- Sphinx: baroque style

Model

Toolkit to handle machine learning models.

Evaluation: Scikit-Learn Metrics!

Motivations:
- Bring common metrics
- Avoid reinventing the wheel
- Avoid implementation mistakes
Limitations:
- Limited set of metric to be chosen
Alternatives:
- Implement your own: for custom metrics

Format: Joblib!

Motivations:
- Serialize ML models
- Supported by default for scikit-learn
- Suited for large data (e.g., numpy array)
Limitations:
- Doesn't include model metadata
Alternatives:
- MLflow Model: great solution, but requires a server
- Pickle: work out of the box, but less suited for big array
- ONNX: great for deep learning, no guaranteed compatibility for the rest

Interface: Scikit-Learn Base

Motivations:
- Normalize model interface
- Easy to adopt: only define 4 methods
- Most popular format for data scientists
Limitations:
- Doesn't support model saving/loading methods
Alternatives:
- Implement your own: unknown for your users

Storage: Filesystem!

Motivations:
- Store ML models on disk
- Use it for really small project
- Easy to adopt, but doesn't do much
Limitations:
- Should be changed to a better alternative
Alternatives:
- MLflow: great solution, but requires a server
- MLEM: good solution if you use DVC

Package

Define and build modern Python package.

Format: Wheel

Motivations:
- Create source code archive
- Most modern Python format
- Has several advantages
Limitations:
- Doesn't ship with C/C++ dependencies (e.g., CUDA)
  - i.e., use Docker containers for this case
Alternatives:
- Source: older format, less powerful

Manager: Poetry

Motivations:
- Define and build Python package
- Most popular solution by GitHub stars
- Pack every metadata in a single static file
Limitations:
- Cannot add dependencies beyond Python (e.g., CUDA)
  - i.e., use Docker container for this use case
Alternatives:
- Setuptools: dynamic file is slower and more risky
- Pdm, Hatch, PipEnv: https://xkcd.com/1987/

Runtime: Docker

Motivations:
- Create isolated runtime
- Container is the de facto standard
- Package C/C++ dependencies with your project
Limitations:
- Some company might block Docker Desktop, you should use alternatives
Alternatives:
- Conda: slow and heavy resolver

Programming

Select your programming environment.

Language: Python

Motivations:
- Great language for AI/ML projects
- Robust with additional tools
- Hundreds of great libs
Limitations:
- Slow without C bindings
- Need of a Gilectomy
Alternatives:
- R: specific purpose language
- Julia: specific purpose language

Version: Pyenv

Motivations:
- Switch between Python version
- Allow to select the best version
- Support global and local dispatch
Limitations:
- Require some shell configurations
Alternatives:
- Manual installation: time consuming

Tips

This sections gives some tips and tricks to enrich the develop experience.

AI/ML Practices

Data Catalog

You should decouple the pointer to your data from how to access it.

In your code, you can refer to your dataset with a tag (e.g., inputs, target).

This tag can then be associated to an reader/writer implementation in a configuration file:

inputs:
  KIND: ParquetDataset
  path: data/inputs.parquet
target:
  KIND: ParquetDataset
  path: data/target.parquet

In this package, the implementation are described in src/[package]/datasets.py and selected by KIND.

Hyperparameter Optimization

You should select the best hyperparameters for your model using optimization search.

The simplest projects can use a sklearn.model_selection.GridSearchCV to scan the whole search space.

This package provides a simple interface to this hyperparameter search facility in src/[packager]/searchers.py.

For more complex project, we recommend to use more complex strategy (e.g., Bayesian) and software package (e.g., Optuna).

Data Splits

You should properly split your dataset into a training, validation, and testing sets.

Training: used for fitting the model parameters
Validation: used to find the best hyperparameters
Testing: used to evaluate the final model performance

The sets should be exclusive, and the testing set should never be used as training inputs.

This package provides a simple deterministic strategy implemented in src/[package]/splitters.py.

Design Patterns

Directed-Acyclic Graph

You should use Directed-Acyclic Graph (DAG) to connect the steps of your ML pipeline.

A DAG can express the dependencies between steps while keeping the individual step independent.

This package provides a simple DAG example in tasks/dag.py. This approach is based on PyInvoke.

In production, we recommend to use a scalable system such as Airflow, Dagster, Prefect, Metaflow, or ZenML.

Program Service

You should provide a global context for the execution of your program.

There are several approaches such as Singleton, Global Variable, or Component.

This package takes inspiration from Clojure mount. It provides an implementation in src/[package]/services.py.

Soft Coding

You should separate the program implementation from the program configuration.

Exposing configurations to users allow them to influence the execution behavior without code changes.

This package seeks to expose as much parameter as possible to the users in configurations stored in the confs/ folder.

SOLID Principles

You should implement the SOLID principles to make your code as flexible as possible.

Single responsibility principle: Class has one job to do. Each change in requirements can be done by changing just one class.
Open/closed principle: Class is happy (open) to be used by others. Class is not happy (closed) to be changed by others.
Liskov substitution principle: Class can be replaced by any of its children. Children classes inherit parent's behaviours.
Interface segregation principle: When classes promise each other something, they should separate these promises (interfaces) into many small promises, so it's easier to understand.
Dependency inversion principle: When classes talk to each other in a very specific way, they both depend on each other to never change. Instead classes should use promises (interfaces, parents), so classes can change as long as they keep the promise.

In practice, this mean you can implement software contracts with interface and swap the implementation.

For instance, you can implement several jobs in src/[package]/jobs.py and swap them in your configuration.

To learn more about the mechanism select for this package, you can check the documentation for Pydantic Tagged Unions.

Python Powers

Context Manager

You should use Python context manager to control and enhance an execution.

Python provides contexts that can be used to extend a code block. For instance:

# in src/[package]/scripts.py
with job as runner:  # context
    runner.run()  # run in context

This pattern has the same benefit as Monad, a powerful programming pattern.

The package uses src/[package]/jobs.py to handle exception and services.

Python Package

You should create Python package to create both library and application for others.

Using Python package for your AI/ML project has the following benefits:

Build code archive (i.e., wheel) that be uploaded to Pypi.org
Install Python package as a library (e.g., like pandas)
Expose script entry points to run a CLI or a GUI

To build a Python package with Poetry, you simply have to type in a terminal:

# for all poetry project
poetry build
# for this project only
inv package

Software Engineering

Code Typing

You should type your Python code to make it more robust and explicit for your user.

Python provides the typing module for adding type hints and mypy to checking them.

# in src/[package]/models.py
@abc.abstractmethod
def fit(self, inputs: schemas.Inputs, target: schemas.Target) -> "Model":
    """Fit the model on the given inputs and target."""

@abc.abstractmethod
def predict(self, inputs: schemas.Inputs) -> schemas.Output:
    """Generate an output with the model for the given inputs."""

This code snippet clearly state the inputs and outputs of the method, both for the developer and the type checker.

The package aims to type every functions and classes to facilitate the developer experience and fix mistakes before execution.

Config Typing

You should type your configuration to avoid exceptions during the program execution.

Pydantic allows to define classes that can validate your configs during the program startup.

# in src/[package]/splitters.py
class TrainTestSplitter(Splitter):
    ratio: float = 0.8
    shuffle: bool = True
    random_state: int = 42

This code snippet allows to communicate the values expected and avoid error that could be avoided.

The package combines both OmegaConf and Pydantic to parse YAML files and validate them as soon as possible.

Dataframe Typing

You should type your dataframe to communicate and validate their fields.

Pandera supports dataframe typing for Pandas and other library like PySpark:

# in src/package/schemas.py
class InputsSchema(Schema):
    alcohol: papd.Series[float] = pa.Field(gt=0, lt=100)
    malic_acid: papd.Series[float] = pa.Field(gt=0, lt=10)
    ash: papd.Series[float] = pa.Field(gt=0, lt=10)
    alcalinity_of_ash: papd.Series[float] = pa.Field(gt=0, lt=100)
    magnesium: papd.Series[float] = pa.Field(gt=0, lt=1000)
    total_phenols: papd.Series[float] = pa.Field(gt=0, lt=10)
    flavanoids: papd.Series[float] = pa.Field(gt=0, lt=10)
    nonflavanoid_phenols: papd.Series[float] = pa.Field(gt=0, lt=10)
    proanthocyanins: papd.Series[float] = pa.Field(gt=0, lt=10)
    color_intensity: papd.Series[float] = pa.Field(gt=0, lt=100)
    hue: papd.Series[float] = pa.Field(gt=0, lt=10)
    od280_od315_of_diluted_wines: papd.Series[float] = pa.Field(gt=0, lt=10)
    proline: papd.Series[float] = pa.Field(gt=0, lt=10000)

This code snippet defines the fields of the dataframe and some of its constraint.

The package encourages to type every dataframe used in src/[package]/schemas.py.

Object Oriented

You should use the Objected Oriented programming to benefit from polymorphism.

Polymorphism combined with SOLID Principles allows to easily swap your code components.

class Dataset(abc.ABC, pdt.BaseModel):

    @abc.abstractmethod
    def read(self) -> pd.DataFrame:
        """Read a dataframe from a dataset."""

    @abc.abstractmethod
    def write(self, data: pd.DataFrame) -> None:
        """Write a dataframe to a dataset."""

This code snippet uses the abc module to define code interfaces for a dataset with a read/write method.

The package defines class interface whenever possible to provide intuitive and replaceable parts for your AI/ML project.

Semantic Versioning

You should use semantic versioning to communicate the level of compatibility of your releases.

Semantic Versioning (SemVer) provides a simple schema to communicate code changes. For package X.Y.Z:

Major (X): major release with breaking changed (i.e., imply actions from the benefit)
Minor (Y): minor release with new features (i.e., provide new capabilities)
Patch (Z): patch release to fix bugs (i.e., correct wrong behavior)

Poetry and this package leverage Semantic Versioning to let developers control the speed of adoption for new releases.

Testing Tricks

Parallel Testing

You can run your tests in parallel to speed up the validation of your code base.

Pytest can be extended with the pytest-xdist plugin for this purpose.

This package enables Pytest in its automation tasks by default.

Test Fixtures

You should define reusable objects and actions for your tests with fixtures.

Fixture can prepare objects for your test cases, such as dataframes, models, files.

This package defines fixtures in tests/conftest.py to improve your testing experience.

VS Code

Code Workspace

You can use VS Code workspace to define configurations for your project.

Code Workspace can enable features (e.g. formatting) and set the default interpreter.

{
	"settings": {
		"editor.formatOnSave": true,
		"python.defaultInterpreterPath": ".venv/bin/python",
    ...
	},
}

This package defines a workspace file that you can load from [package].code-workspace.

GitHub Copilot

You can use GitHub Copilot to increase your coding productivity by 30%.

GitHub Copilot has been a huge productivity thanks to its smart completion.

You should become familiar with the solution in less than a single coding session.

VSCode VIM

You can use VIM keybindings to more efficiently navigate and modify your code.

Learning VIM is one of the best investment for a career in IT. It can make you 30% more productive.

Compared to GitHub Copilot, VIM can take much more time to master. You can expect a ROI in less than a month.

Resources

This section provides resources for building packages for Python and AI/ML/MLOps.

MLOps Python Package

Table of Contents

Install

Prerequisites

Installation

Next Steps

Usage

Configuration

Execution

Automation

Tools

Automation

Commit: Pre-Commit

Release: Bump2version

Tasks: PyInvoke

CLI

Parser: Argparse!

Logging: Loguru

Code

Coverage: Coverage

Editor: VS Code

Formatting: Isort + Black

Quality: Pylint

Testing: Pytest

Typing: Mypy

Versioning: Git

Configs

Format: YAML

Parser: OmegaConf

Reader: Cloudpathlib

Validator: Pydantic

Data

Container: Pandas

Format: Parquet

Schema: Pandera

Docs

API: pdoc!

Format: Google

Model

Evaluation: Scikit-Learn Metrics!

Format: Joblib!

Interface: Scikit-Learn Base

Storage: Filesystem!

Package

Format: Wheel

Manager: Poetry

Runtime: Docker

Programming

Language: Python

Version: Pyenv

Tips

Resources

Python

AI/ML/MLOps

About

Languages