This example demonstrates a simple end-to-end training & deployment of a Keras Resnet model on the CIFAR10 dataset utilizing the following technologies:
- NVIDIA-Docker2 to make the Docker containers GPU aware.
- NVIDIA device plugin to allow Kubernetes to access GPU nodes.
- TensorFlow-19.03 containers from NVIDIA GPU Cloud container registry.
- TensorRT for optimizing the Inference Graph in FP16 for leveraging the dedicated use of Tensor Cores for Inference.
- TensorRT Inference Server for serving the trained model.
- Ubuntu 16.04 and above
- NVIDIA GPU
- Install NVIDIA Docker, Kubernetes and Kubeflow on your local machine (on your first run):
sudo ./install_kubeflow_and_dependencies.sh
- Build the Docker image of each pipeline component and compile the Kubeflow pipeline:
- First, make sure
IMAGE
variable inbuild.sh
in each component dir undercomponents
dir points to a public container registry - Then, make sure the
image
used in eachContainerOp
inpipeline/src/pipeline.py
matchesIMAGE
in the step above - Then, make sure the
image
of the webapp Deployment incomponents/webapp_launcher/src/webapp-service-template.yaml
matchesIMAGE
incomponents/webapp/build.sh
- Then,
sudo ./build_pipeline.sh
- Note the
pipeline.py.tar.gz
file that appears in your working directory
- First, make sure
- Determine the ambassador port:
sudo kubectl get svc -n kubeflow ambassador
- Open the Kubeflow UI on:
- https://[local-machine-ip-address]:[ambassador-port]/
- E.g. https://10.110.210.99:31342/
- Click on Pipeline Dashboard tab, upload the
pipeline.py.tar.gz
file you just compile and create a run - Training takes about 20 minutes for 50 epochs and a web UI is deployed as part of the pipeline so user can interact with the served model
- Access the client web UI:
- https://[local-machine-ip-address]:[kubeflow-ambassador-port]/[webapp-prefix]/
- E.g. https://10.110.210.99:31342/webapp/
- Now you can test the trained model with random images and obtain class prediction and probability distribution
Following are optional scripts to cleanup your cluster (useful for debugging)
- Delete deployments & services from previous runs:
sudo ./clean_utils/delete_all_previous_resources.sh
- Uninstall Minikube and Kubeflow:
sudo ./clean_utils/remove_minikube_and_kubeflow.sh