czangyeob/MLOps

MLOps

머신러닝 오퍼레이션 자동화(MLOps)와 관련된 내용을 작성하는 Repo입니다

Model & Pipeline Versioning

Data Version Control (DVC) - A git fork that allows for version management of models
ModelDB - Framework to track all the steps in your ML code to keep track of what version of your model obtained which accuracy, and then visualise it and query it via the UI
Pachyderm - Open source distributed processing framework build on Kubernetes focused mainly on dynamic building of production machine learning pipelines - (Video)
steppy - Lightweight, Python3 library for fast and reproducible machine learning experimentation. Introduces simple interface that enables clean machine learning pipeline design.
Jupyter Notebooks - Web interface python sandbox environments for reproducible development
Quilt Data - Versioning, reproducibility and deployment of data and models.
H2O Flow - Jupyter notebook-like inteface for H2O to create, save and re-use "flows"
ModelChimp - Framework to track and compare all the results and parameters from machine learning models (Video)
PredictionIO - An open source Machine Learning Server built on top of a state-of-the-art open source stack for developers and data scientists to create predictive engines for any machine learning task
MLflow - Open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment.
Sacred - Tool to help you configure, organize, log and reproduce machine learning experiments.
FGLab - Machine learning dashboard, designed to make prototyping experiments easier.
Studio.ML - Model management framework which minimizes the overhead involved with scheduling, running, monitoring and managing artifacts of your machine learning experiments.

Data Storage / Standardisation / Privacy

EdgeDB - NoSQL interface for Postgres that allows for object interaction to data stored
BayesDB - Database that allows for built-in non-parametric Bayesian model discovery and queryingi for data on a database-like interface - (Video)
Apache Arrow - In-memory columnar representation of data compatible with Pandas, Hadoop-based systems, etc
Apache Parquet - On-disk columnar representation of data compatible with Pandas, Hadoop-based systems, etc
Apache Kafka - Distributed streaming platform framework
Uber SQL Differencial Privacy - Uber's open source framework that enforces differential privacy for general-purpose SQL queries.
ClickHouse - ClickHouse is an open source column oriented database management system supported by Yandex - (Video)

Feature Engineering Automation

auto-sklearn - Framework to automate algorithm and hyperparameter tuning for sklearn
TPOT - Automation of sklearn pipeline creation (including feature selection, pre-processor, etc)
tsfresh - Automatic extraction of relevant features from time series
Featuretools - An open source framework for automated feature engineering
Colombus - A scalable framework to perform exploratory feature selection implemented in R
automl - Automated feature engineering, feature/model selection, hyperparam. optimisation

Model Deployment Frameworks

Seldon - Open source platform for deploying and monitoring machine learning models in kubernetes - (Video)
Redis-ML - Module available from unstable branch that supports a subset of ML models as Redis data types
Model Server for Apache MXNet (MMS) - A model server for Apache MXNet from Amazon Web Services that is able to run MXNet models as well as Gluon models (Amazon's SageMaker runs a custom version of MMS under the hood)
Tensorflow Serving - High-performant framework to serve Tensofrlow models via grpc protocol able to handle 100k requests per second per core
Clipper - Model server project from Berkeley's Rise Rise Lab which includes a standard RESTful API and supports TensorFlow, Scikit-learn and Caffe models
DeepDetect - Machine Learning production server for TensorFlow, XGBoost and Cafe models written in C++ and maintained by Jolibrain
MLeap - Standardisation of pipeline and model serialization for Spark, Tensorflow and sklearn
OpenScoring - REST web service for scoring PMML models built and maintained by OpenScoring.io
NVIDIA TensorRT - Model server created by NVIDIA that runs models in ONNX format, including frameworks such as TensorFlow and MATLAB

Data Pipeline Frameworks

Apache Airflow - Data Pipeline framework built in Python, including scheduler, DAG definition and a UI for visualisation
Luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs, handling dependency resolution, workflow management, visualisation, etc
Genie - Job orchestration engine to interface and trigger the execution of jobs from Hadoop-based systems
Oozie - Workflow scheduler for Hadoop jobs

Infrastructure Orchestration Frameworks

Kubeflow - A cloud native platform for machine learning based on Google’s internal machine learning pipelines.
Polyaxon - A platform for reproducible and scalable machine learning and deep learning on kubernetes. - (Video)

Optimization of Computation

Numba - A compiler for Python array and numerical functions

Reference

awesome-machine-learning-operations

czangyeob / MLOps

MLOps

Model & Pipeline Versioning

Data Storage / Standardisation / Privacy

Feature Engineering Automation

Model Deployment Frameworks

Data Pipeline Frameworks

Infrastructure Orchestration Frameworks

Optimization of Computation

Reference

About