Awesome Machine Learning Engineer

For more awesomeness, check out .

What is this and how do I use it?

This is a curated list of delightful resources for everything you need to develop Machine Learning solutions.
Each item in this list will teach you at least one distinct and significant skill or piece of information.
There are three content levels:
1. 🐥 Essential reading for all ML engineers
2. 🐍 Advanced reading for professional ML engineers
3. 🦄 Expert material for expert ML engineers
Descriptions are written to complete the sentence "After reading this article you will have learned ...".

Communication
Software Engineering
Machine Learning
DevOps

Communication

🐥 BLUF: The Military Standard That Can Make Your Writing More Powerful - How to make your communication more powerful (5 min)
🐥 The XY Problem - How to focus on explaining your end goal when asking for help (5 min)
🐥 Bike-shedding: how mature are you as an engineer? - How to avoid and call out bike-shedding (5 min)
🐥 E-mail like a boss - How to write better e-mails (5 min)
🐥 Stop Swiss Cheesing your calendar - How to manage your calendar so you can focus (15 min)
🐥 How to write in plain English - How to write in plain English (30 min)
🐥 Presentation Rules - How to create a great slide deck (30 min)
🐍 SMART criteria - How to define goals (15 min)
🐍 MECE principle - How to fully decompose a problem into a structured list (15 min)
🐍 SCQA: What is it, how does it work, and how can it help me? - How to structure your presentations, proposals, and sales outlines (15 min)
🐍 No More Misunderstandings - How to avoid miscommunication by paraphrasing (15 min)
🦄 Nonviolent communication - How to deliver constructive feedback in difficult situations (15 min)
🦄 The Halo effect - How to recognize and use the Halo effect to your advantage (15 min)
🦄 Mythical Man Month - The relationship between person-days and throughput time in a project (15 min)
🦄 Four-sides model - How to communicate effectively by considering how the receiver interprets your message (30 min)

Software Engineering

API design

🐥 Semantic Versioning - How to bump the version of your apps and packages (15 min)
🐥 __all__ and wild imports in Python - How __all__ defines the public API of your Python packages (15 min)
🐥 APIs for Machine Learning - How to design RESTful APIs for Machine Learning applications (30 min)
🐍 FastAPI docs - How to build RESTful APIs that correspond one-to-one with an OpenAPI specification (1 day)
🐍 The Rule of Three - When to build reusable components and when not (15 min)
🐍 Falsehoods programmers believe about time - How to avoid common pitfalls about time (15 min)
🐍 Falsehoods programmers believe about names - How to avoid common pitfalls about names (15 min)
🐍 Command Line Interface Guidelines - How to write great CLIs (1 hour)
🦄 Zalando's RESTful API guidelines - How to design RESTful APIs (1 day)

Workflow

🐥 Poetry Cookiecutter - How to scaffold a modern Poetry-based development environment for Python packages and apps (30 min)
🐥 The seven rules of a great Git commit message - How to write great Git commit messages (15 min)
🐥 Learn Git Branching - Practice Git from beginner to advanced (1 hour)
🐍 Keep a Changelog - How to keep a changelog for your apps and packages (30 min)
🐍 Conventional Commits - How to prefix your commit messages to automate Semantic Versioning and Keep a Changelog (15 min)
🐍 Testing Python Applications with Pytest - How to properly test a package with pytest (30 min)
🐍 A successful Git branching model - How to release software with Git (15 min)
🐍 Code Review Best Practices - What to look for when reviewing a Pull Request (30 min)
🐍 Code Health: Respectful Reviews == Useful Reviews - How to communicate code review comments respectfully (15 min)
🐍 The Code Review Pyramid - What to look for and what to automate when reviewing a Pull Request (15 min)
🦄 Poetry workspace plugin - How to create and manage a Poetry-based monorepo (15 min)

Python patterns

🐥 PEP20 "The Zen of Python" - How to write idiomatic Python (15 min)
🐥 The Definitive Guide to Python import Statements - How to write import statements (30 min)
🐥 Understanding Python's logging module - How to use the logging module effectively (30 min)
🐍 Don't run code at import time - Why you shouldn't run code at import time
🐍 Please fix your decorators - Why you should probably use wrapt to write your decorators (30 min)
🦄 Do not log - What you should be doing instead of logging (30 min)
🦄 The Little Book of Python Anti-Patterns - A collectiong of Python anti-patterns (X hours)
🦄 Effective Python - A collection of Python idioms (X hours)
🦄 Python Design Patterns - A collection of software architecture patterns (1 hour)
🦄 SOLID - A standard set of software architecture patterns (1 hour)
🦄 What the f*ck Python! - How to master Python by understanding its edge cases (1 day)

Typing

🐍 The Comprehensive Guide to mypy - How to write type annotations in Python (1 hour)
🐍 Pydantic overview - How to write type annotations for complex types instead of a meaningless Dict[str, Any] (1 hour)
🐍 Magic number - Why magic values are an anti-pattern (15 min)
🐍 Enums - How to write Enums in Python instead of type-unsafe magic values (15 min)
🦄 Mypy generics - How to use TypeVars to write generic types such as List[T] (30 min)
🦄 Mypy protocols - How to use Protocols to define interfaces such as Iterable (30 min)

Curated Python packages

Workflow

🐍 cookiecutter - Scaffold new Python packages or apps quickly with a Cookiecutter template
🐍 cruft - Update a Python package's underlying Cookiecutter scaffolding
🐍 commitizen - Check that commit messages satisfy Conventional Commits and automate Semantic Versioning and Keep a Changelog
🐍 poetry - Manage the packaging and dependencies of your Python project
🐍 poe - Define and run tasks in a Poetry project with Poe the Poet
🐍 poetry-workspace-plugin - Manage a Python monorepo with this Poetry plugin

Code quality

🐥 black - Automatically format your code
🐥 isort - Automatically sort your import statements
🐍 pre-commit - Automatically run code quality checks on commit
🐍 bandit - Find common security issues
🐍 darglint - Check that your docstrings match your function signature
🐍 flake8 - Check your code for bugs and that your code style is PEP8-compliant
🐍 flake8 extensions - An awesome list of Flake8 extensions
🐍 mypy - Check the type-correctness of your code
🐍 pre-commit hooks - A collection of pre-commit hooks that check file quality
🐍 pydocstyle - Check that your code is documented
🐍 pygrep hooks - A collection of pre-commit hooks that check for common Python code smells
🐍 pytest-recording - Record and play back HTTP requests in your pytest tests
🐍 pyupgrade - Check that your code is written using the latest Python language features
🐍 safety - Check that your dependencies don't have any known security vulnerabilities
🐍 shellcheck - Check the quality of your shell scripts
🐍 coverage.py - Check your code's test coverage
🦄 hypothesis - Write tests that automatically look for edge cases that break your code
🦄 hypothesis-auto - Automate generate Hypothesis tests based on your code's type annotations

Application development

🐍 fastapi - Create RESTful APIs based on type annotations
🐍 typer - Create CLIs based on type annotations
🐍 streamlit - Create web apps with a single Python file

Utilities

🐍 bump2version - Release a new version of your package
🐍 coloredlogs - Increase your logs' readability with colour
🐍 hvplot - Create interactive plots from pandas dataframes
🐍 mkdocs - Create developer documentation for your project
🐍 pdoc - Generate API documentation for your code
🐍 birdseye - Graphically debug your Python code
🐍 scalene - Profile your code's CPU and memory usage by line
🐍 viztracer - Vizualize your code's performance with a flamegraph
🐍 tqdm - Easily add progress bars to long-running jobs

Machine Learning

Practical theory

🐍 Bias-variance tradeoff - How a model's total error is the sum of bias and variance (30 min)
🐍 The two different uses of cross-validation - How to use nested cross-validation to combine the two different uses of cross-validation (30 min)
🐍 Modes, Medians and Means: A Unifying Perspective - Why minimizing the Mean Absolute Error (MAE) is more robust than minimizing the Mean Squared Error (MSE) (30 min)
🐍 Backpropagation is the chain rule to compute the gradient - How backpropagation is an algorithm to compute the objective function's gradient (30 min)
🐍 Stacked generalization - How to stack models (30 min)
🐍 We have been using the wrong initialization for t-SNE and UMAP - How to initialize t-SNE and UMAP properly (15min)
🦄 From classic Fully Connected Networks to Transformers - How neural networks evolved from Fully Connected Networks to Transformers (30 min)
🦄 What is the .632+ rule? - How to measure generalization performance with bootstrapping (30 min)
🦄 Stacking strategies with and without leaks - Different strategies to stack models (30 min)
🦄 Data Distribution Shifts and Monitoring - How to detect and address the different types of data shift (1 hour)
🦄 Backprop is not just the chain rule - How backpropagation relates to Lagrange multipliers (30 min)
🦄 Why ML algorithms are hard to tune - Optimize multiple objectives when the Pareto front is concave (30min)
🦄 Deep learning model compression - How quantization, pruning, and distillation can be used to compress models (30 min)

Explainability

🐍 SHAP: SHapley Additive exPlanations - How to explain a model's output with Shapley values (30 min)
🦄 Intro to Shapley and SHAP - How Shapley values are approximated by SHAP (30 min)

Unsupervised

🐍 UMAP: Uniform Manifold Approximation and Projection - How to reduce dimensionality for visualization and modelling (30 min)
🐍 PyNNDescent - How to find nearest neighbours in huge datasets (15 min)

Classification

🐥 Precision and recall - How precision and recall measure a classifier's performance (30 min)
🐍 Probability calibration - How and for which model types you should calibrate the model's output scores into probabilities (30 min)
🐍 You're all calculating churn rates wrong - Correctly define what churn is (30 min)

Regression

🦄 Gaussian processes - From scratch - How to build probabilistic regression models with Gaussian Processes (1 hour)

Computer Vision

🐍 Microsoft's Document Image Transformer - A self-supervised pre-trained model that achieves SotA performance on PubLayNet and can be used for various downstream tasks (30 min)

Natural Language Processing

🐍 Awesome Sentence Embedding - A curated list of pretrained sentence and word embedding models (15 min)

Time Series Analysis

🐍 The Prophet model - How Meta's Prophet model decomposes a time series into a trend, seasonality, and holiday components (30 min)
🐍 Darts - Time Series Made Easy in Python - How to build forecasting models with darts (1 hour)

Recommender Systems

🐍 Microsoft Recommenders - A comparison of recommender system models (30 min)

Tensor computation libraries

🐍 What I Wish Someone Had Told Me About Tensor Computation Libraries - How JAX, PyTorch, TensorFlow, and Theano are different (30 min)

Pandas

🐍 Modern Pandas series (Part 1 - 7) - Write idiomatic pandas (1 hour)
🐍 Awesome Pandas - An awesome list of Pandas resources (1 hour)

Sci-kit learn

🐥 Using scikit-learn Pipelines and FeatureUnions - How to use Pipelines to build end-to-end models (30 min)
🐥 Transforming target in regression - How to transform the target to build more robust models (15 min)
🐍 ColumnTransformer for heterogeneous data - How to use ColumnTransformer to process pandas DataFrames in sklearn Pipelines (30 min)
🐍 Custom Estimators - Create your own custom Estimator (30 min)
🐍 Hyperparameter optimization with successive halving - How to optimize hyperparameters with the most computationally efficient method (30 min)

Labelling

🐍 Doccano - A tool for labelling text (30 min)
🐍 CVAT: Computer Vision Annotation Tool - A tool for labelling images (30 min)
🐍 Awesome Data Labelling - An awesome list of data labelling tools (30 min)

DevOps

CI/CD

🐍 invoke - How to implement common tasks you run on your project as a CLI (30 min)
🐍 poe - How to implement common tasks you run on your project as a CLI (30 min)

Environment and dependency management

🐥 Intro to packaging and dependency management for Python with Poetry - How to manage your Python package's dependencies and environment (30 min)
🐍 Intro to Pyenv for Machine Learning - How to use pyenv to manage your Python interpreter (30 min)
🐍 Modern Python Environments - dependency and workspace management - A comparison between pyenv, venv + pip, venv + pip-tools, poetry, pipenv, and conda (30 min)
🦄 Conda: Myths and Misconceptions - Common misconceptions about Conda (15 min)

Docker

🐥 Docker Curriculum - How to use Docker (4 hours)
🐍 Docker layer caching - How to write Dockerfiles to benefit from layer caching (30 min)
🐍 Dockerfile best practices - How to write good Dockerfiles (1 hour)
🐍 Configuring Gunicorn for Docker - How to best configure Gunicorn for a Docker image (30 min)
🐍 Speed up Docker with BuildKit’s new caching - How to speed up Docker builds with a build cache (30 min)
🦄 Build secrets in Docker and Compose, the secure way - How to use secrets in a Docker build (15 min)
🦄 Security scanners for Python and Docker - How to scan your Docker image for security issues with your code and Docker image (30 min)
🦄 The security scanner that cried wolf - How to scan your Docker image for security issues without false positives (15 min)
🦄 Awesome Docker - An awesome list of Docker resources (30 min)

Data pipelines

🐍 Great Expectations - How to test and document your data and data pipelines (30 min)

Shell

🐍 Cron best practices - How to best use cron to schedule tasks (30 min)
🐍 A visual guide to SSH tunnels - How to forward ports and create tunnels with SSH (30 min)
🐍 Safe ways to do things in bash - How to write safe and robust shell scripts (1 hour)
🦄 Your terminal is not a terminal: An Introduction to Streams - How your terminal is a tool to manipulate streams (30 min)
🦄 Bash Heredoc - How to pass multiline arguments to commands with a heredoc (30 min)
🦄 Please stop writing shell scripts - Why you shouldn't write shell scripts for CI/CD or Docker images (30 min)

Terraform

🐥 An Introduction to Terraform - How to use Terraform (1 hour)
🐍 Terraform best practices - Terraform best practices (1 hour)
🦄 Terraform pre-commit hooks collection - How to automate Terraform code quality checks with pre-commit (1 hour)
🦄 Awesome Terraform - An awesome list of Terraform resources (30 min)

Infrastructure

🐥 Using Redis In-Memory Storage for your Python Applications - How to use Redis as an in-memory cache for your Python application (30 min)
🐍 Python Kafka Consumers: at-least-once, at-most-once, exactly-once - How to write different types of Kafka consumers in Python (30 min)
🦄 Kafka Exactly-Once-Semantics - How to produce and consume messages exactly once (1 hour)
🦄 RabbitMQ: a message queue library with persistance - RabbitMQ is a messaging system with a message broker (4 hours)
🦄 ZeroMQ: a socket library with message queue primitives - ZeroMQ is a lightweight messaging system without a message broker (8 hours)

Curated by Radix

Radix is a Belgium-based Machine Learning company.

We invent, design and develop AI-powered software. Together with our clients, we identify which problems within organizations can be solved with AI, demonstrating the value of Artificial Intelligence for each problem.

Our team is constantly looking for novel and better-performing solutions and we challenge each other to come up with the best ideas for our clients and our company.

Here are some examples of what we do with Machine Learning, the technology behind AI:

Help job seekers find great jobs that match their expectations. On the Belgian Public Employment Service website, you can find our job recommendations based on your CV alone.
Help hospitals save time. We extract diagnosis from patient discharge letters.
Help publishers estimate their impact by detecting copycat articles.

We work hard and we have fun together. We foster a culture of collaboration, where each team member feels supported when taking on a challenge, and trusted when taking on responsibility.

giordano-lucas / awesome-machine-learning-engineer

Awesome Machine Learning Engineer

What is this and how do I use it?

Contents

Communication

Software Engineering

API design

Workflow

Python patterns

Typing

Curated Python packages

Workflow

Code quality

Application development

Utilities

Machine Learning

Practical theory

Explainability

Unsupervised

Classification

Regression

Computer Vision

Natural Language Processing

Time Series Analysis

Recommender Systems

Tensor computation libraries

Pandas

Sci-kit learn

Labelling

DevOps

CI/CD

Environment and dependency management

Docker

Data pipelines

Shell

Terraform

Infrastructure

Curated by Radix

About