computer-vision machine-learning software-engineering

The Anatomy of a Machine Learning Pipeline

Overview

This repository contains the code of the webinar The Anatomy of a Machine Learning Pipeline, presented both at ENBIS and at the University of Milan-Bicocca, based on the book The Pragmatic Programmer for Machine Learning published by Taylor & Francis.

Editions

Slides

The Anatomy of a Machine Learning Pipeline

Prerequisites

To utilize this repository effectively, it's essential to install specific software dependencies on your computer using your Linux distribution's package manager, brew for MacOS, or Chocolately for Windows. These dependencies are crucial for the optimal operation of the codebase and reproducibility.

Docker 26+ - Docker documentation.
Docker Compose 2.27+ - Docker Compose documentation.
Python 3.10+ - Python documentation.
Poetry 1.8+ - Poetry documentation.
GNU Make 3.81+ - GNU Make documentation.

To verify that you have the correct version of the software installed, run the following commands (for MacOS and GNU/Linux users):

make check-deps

First installation

We suggest using asdf to manage the Python version, you can install it following the instructions at asdf.

asdf install plugin python
asdf install python 3.10.15
asdf plugin-add direnv 
asdf direnv setup --shell bash --version latest 
python --version
pip install poetry
poetry config virtualenvs.in-project true
poetry config virtualenvs.path .venv
git clone https://github.com/pragprogml/enbis-2024
cd enbis-2024
poetry install --no-root
poetry env info

Environment variables

cp .envrc.example .envrc
vim .envrc
cat .envrc

layout python
ROOT_DIR="/home/user/development/enbis-2024"
PARAMS="params-dev.yaml"
VIRTUAL_ENV=.venv
PATH=$VIRTUAL_ENV/bin:$PATH

Model training and evaluation

dvc stage list

train     Outputs models/best_model.pt
evaluate  Outputs reports/yolo_metrics.png, reports/predicted_images.png

dvc dag

  +-------+
  | train |
  +-------+
      *
      *
      *
+----------+
| evaluate |
+----------+

dvc repro train
dvc repro evaluate
dvc repro --downstream evaluate # without previous stage finalization

Artifacts

The output and artifacts of each training run are stored under runs/, while a straightforward evaluation output can be found in the reports/ directory.

Model training and evaluation using Jupyter Notebook

Experiment Tracking

Update the YOLO setting in training.py:training_config to enable W&B or MLFlow, or use

yolo settings mlflow={True|False} wandb={True|False}

Web Evaluation Dashboard

make run-demo

Inference API

make run-api

make docker-build
make docker-run

Lint and format

make lint

Dataset

The dataset we presented during the seminar is proprietary and unfortunately cannot be shared or used outside of our organization. You might consider using a dataset that contains a single class, such as the one available at Synthetic Corrosion Dataset Computer Vision Project. Please let us know if you have any questions or need further assistance in finding suitable datasets for your work.

Contact

Reach out to us via https://github.com/pragprogml.

Authors

Marco Scutari, Ph.D. - Senior Researcher in Bayesian Networks and Graphical Models - Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA)
Mauro Malvestio - Founder & CTO @ DSCOVR

License

Citing

When citing PPML in academic papers and theses, please use this BibTeX entry:

@BOOK{ppml,
  author        = {M. Scutari and M. Malvestio},
  title         = {{The Pragmatic Programmer for Machine Learning: Engineering
                    Analytics and Data Science Solutions}},
  publisher     = {Chapman \& Hall},
  year          = {2023}
}

About

The Anatomy of a Machine Learning Pipeline

https://github.com/pragprogml/enbis-2024

computer-vision machine-learning software-engineering

MIT License

Languages

Language:Python 55.1%Language:Jupyter Notebook 39.1%Language:Shell 3.0%Language:Makefile 2.2%Language:Dockerfile 0.7%