matty1979 / aws-spot-instances-kubeflow

Config files for setting up Multitenant Kubeflow on AWS with spot instances

Home Page:https://blog.gofynd.com/how-we-reduced-our-ml-training-costs-by-78-a33805cb00cf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Kubeflow on Spot instances

Kubeflow Logo

Kubernetes GitHub issues GitHub forks GitHub Stars GitHub License

Config files for setting up Multitenant Kubeflow on AWS with spot instances Repo contains supporting code for How we reduced our ML training costs by 78%

By the end of this tutorial, you will have:

  • An EKS cluster with Kubernetes 1.14 on AWS
  • Autoscaling with Nodegroup autodiscovery enabled
  • GPU nodes
    • With scale-down-to-zero at no workload
    • Spot Instance purchase enabled by default
  • Kubeflow 1.0.1 running on the cluster with only GPU requesting resources running on GPU nodes

TLDR;

# setup environment
export ENVIRONMENT=staging
export AWS_PROFILE=<your profile>
source envs/$ENVIRONMENT/variables.sh
# Create cluster
eksctl create cluster -f envs/$ENVIRONMENT/cluster-spec.yml
kubectl cluster-info # to check if the cluster is connected
# set executable
chmod a+x *.sh
# Deploy Kubeflow
./deploy_kubeflow.sh

Prequisites

AWS

CLI

Cluster Spec

The cluster that gets spun up will have the following specs:

  • ng-1
    • m5a.2xlarge
    • min nodes: 0
    • max nodes: 3
    • vol: 100 GB
  • ng-2
    • m5a.2xlarge
    • min: 0
    • max: 10
    • vol: 20 GB
  • 1-gpu-spot-p2-xlarge
    • p2.xlarge
    • min nodes: 0
    • max nodes: 10
    • max price: $1.2
  • 1-gpu-spot-p3-2xlarge
    • p3.2xlarge
    • min nodes: 0
    • max nodes: 10
    • max price: $1.2
  • 4-gpu-spot-p3-8xlarge
    • p3.8xlarge
    • min nodes: 0
    • max nodes: 4
    • max price: OnDemand
  • 8-gpu-spot-p3dn-24xlarge -- Disabled by default
    • p3dn.24xlarge
    • min nodes: 0
    • max nodes: 1
    • max price: $11

About

Config files for setting up Multitenant Kubeflow on AWS with spot instances

https://blog.gofynd.com/how-we-reduced-our-ml-training-costs-by-78-a33805cb00cf

License:MIT License


Languages

Language:Shell 100.0%