NuthulaTarun / mlops

ML Ops: Machine Learning Operations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ML Ops: Machine Learning Operations

This repository contains some examples of MLOps with code

Machine learning workflow with Airflow, MLflow and SageMaker

ML Workflow

Workflow tasks are scheduled by Airflow and experiments are logged in MLflow

  1. The business problem is framed as a machine learning problem: what is observed and what is predicted.
  2. Data acquisition: ingesting data from sources including data collection, data integration and data quality checking.
  3. Data pre-processing: handling missing data, outliers, long tails, .etc.
  4. Feature engineering: running experiments with different features, adding, removing and changing features.
  5. Data transformation: standardizing data, converting data format compatible with training algorithms.
  6. Job training: training’s parameters, metrics, .etc are tracked in the MLflow. We can also run SageMaker Hyperparameter Optimization with many training jobs then search the metrics and params in the MLflow for a comparison with minimal effort to find the best version of a model.
  7. Model evaluation: analyzing model performance based on predicted results on test data.
  8. If business goals are met, the model will be registered in the SageMaker Inference Models. We can also register the model in the MLflow.
  9. Getting predictions in any of the following ways:
    1. Using SageMaker Batch Transform to get predictions for an entire dataset.
    2. Setting up a persistent endpoint to get one prediction at a time using SageMaker Inference Endpoints.
  10. Monitoring and debugging the workflow, re-training with a data augmentation.

For the data processing, feature engineering and model evaluation, we can use several AWS services

  • EMR: providing a Hadoop ecosystem cluster including pre-installed Spark, Flink, .etc. We should use a transient cluster to process the data and terminate it when all done.
  • Glue job: providing a server-less Apache Spark, Python environments. Glue’ve supported Spark 3.1 since 2021 Aug.
  • SageMaker Processing jobs: running in containers, there are many prebuilt images supporting data science. It also supports Spark 3.

For other steps, we can use AWS SageMaker for job training, hyperparameter tuning, model serving and production monitoring

Airflow and MLflow store their metadata in AWS RDS for PostgreSQL

Data accessing

  • All data stored in S3 can be queried via Athena with metadata from Glue data catalog.
  • We can also ingest the data into SageMaker Feature Store in batches directly to the offline store.

About

ML Ops: Machine Learning Operations


Languages

Language:Python 99.7%Language:Shell 0.3%