Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports Hyperparameter Tuning, Early Stopping and Neural Architecture Search.
Katib is the project which is agnostic to machine learning (ML) frameworks. It can tune hyperparameters of applications written in any language of the users’ choice and natively supports many ML frameworks, such as TensorFlow, MXNet, PyTorch, XGBoost, and others.
- Getting Started
- Name
- Concepts in Katib
- Components in Katib
- Web UI
- GRPC API documentation
- Installation
- Quick Start
- Community
- Contributing
- Citation
Created by doctoc.
Follow the getting-started guide on the Kubeflow website.
Katib stands for secretary
in Arabic.
For a detailed description of the concepts in Katib and AutoML, check the Kubeflow documentation.
Katib has the concepts of Experiment
, Suggestion
, Trial
and Worker Job
.
An Experiment
represents a single optimization run over a feasible space.
Each Experiment
contains a configuration:
- Objective: What you want to optimize.
- Search Space: Constraints for configurations describing the feasible space.
- Search Algorithm: How to find the optimal configurations.
Katib Experiment
is defined as a CRD. Check the detailed guide to
configuring and running a Katib Experiment
in the Kubeflow docs.
A Suggestion
is a set of hyperparameter values that the hyperparameter tuning
process has proposed. Katib creates a Trial
to evaluate
the suggested set of values.
Katib Suggestion
is defined as a CRD.
A Trial
is one iteration of the hyperparameter tuning process.
A Trial
corresponds to one worker job instance with a list of parameter
assignments. The list of parameter assignments corresponds to a Suggestion
.
Each Experiment
runs several Trials
. The Experiment
runs the Trials
until
it reaches either the objective or the configured maximum number of Trials
.
Katib Trial
is defined as a CRD.
The Worker Job
is the process that runs to evaluate a Trial
and calculate
its objective value.
The Worker Job
can be any type of Kubernetes resource or
Kubernetes CRD.
Follow the Trial
template guide
to support your own Kubernetes resource in Katib.
Katib has these CRD examples in upstream:
Thus, Katib supports multiple frameworks with the help of different job kinds.
Katib currently supports several search algorithms. Follow the Kubeflow documentation to know more about each algorithm.
- Random Search
- Tree of Parzen Estimators (TPE)
- Grid Search
- Hyperband
- Bayesian Optimization
- Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
Katib consists of several components as shown below. Each component is running
on Kubernetes as a deployment. Each component communicates with others via GRPC
and the API is defined at pkg/apis/manager/v1beta1/api.proto
.
- Katib main components:
katib-db-manager
- the GRPC API server of Katib which is the DB Interface.katib-mysql
- the data storage backend of Katib using mysql.katib-ui
- the user interface of Katib.katib-controller
- the controller for the Katib CRDs in Kubernetes.
Katib provides a Web UI. You can visualize general trend of Hyper parameter space and each training history. You can use random-example or other examples to generate a similar UI. Follow the Kubeflow documentation to access the Katib UI.
During 1.3 we've worked on a new iteration of the UI, which is rewritten in Angular and is utilizing the common code of the other Kubeflow dashboards. While this UI is not yet on par with the current default one, we are actively working to get it up to speed and provide all the existing functionalities.
The users are currently able to list, delete and create Experiments in their cluster via this new UI as well as inspect the owned Trials. One important missing functionalities are the ability to edit the TrialTemplate ConfigMaps.
While this UI is not ready to replace the current one we would like to
encourage users to also give it a try and provide us with feedback. To try it
out the user has to update the Katib UI image newName
with the new registry
docker.io/kubeflowkatib/katib-new-ui
in the Kustomize
manifests.
Check the Katib v1beta1 API reference docs.
For standard installation of Katib with support for all job operators, install Kubeflow. Follow the documentation:
If you install Katib with other Kubeflow components, you can't submit Katib jobs in Kubeflow namespace. Check the Kubeflow documentation to know more about it.
Alternatively, if you want to install Katib manually with TF and PyTorch operators support, follow these steps:
Create Kubeflow namespace:
kubectl create namespace kubeflow
Clone Kubeflow manifest repository:
git clone -b v1.2-branch git@github.com:kubeflow/manifests.git
Set `MANIFESTS_DIR` to the cloned folder.
export MANIFESTS_DIR=<cloned-folder>
For installing TF operator, run the following:
cd "${MANIFESTS_DIR}/tf-training/tf-job-crds/base"
kustomize build . | kubectl apply -f -
cd "${MANIFESTS_DIR}/tf-training/tf-job-operator/base"
kustomize build . | kubectl apply -f -
For installing PyTorch operator, run the following:
cd "${MANIFESTS_DIR}/pytorch-job/pytorch-job-crds/base"
kustomize build . | kubectl apply -f -
cd "${MANIFESTS_DIR}/pytorch-job/pytorch-operator/base/"
kustomize build . | kubectl apply -f -
Note that your kustomize version should be >= 3.2. To install Katib run:
git clone git@github.com:kubeflow/katib.git
make deploy
Check if all components are running successfully:
kubectl get pods -n kubeflow
Expected output:
NAME READY STATUS RESTARTS AGE
katib-controller-858d6cc48c-df9jc 1/1 Running 1 20m
katib-db-manager-7966fbdf9b-w2tn8 1/1 Running 0 20m
katib-mysql-7f8bc6956f-898f9 1/1 Running 0 20m
katib-ui-7cf9f967bf-nm72p 1/1 Running 0 20m
pytorch-operator-55f966b548-9gq9v 1/1 Running 0 20m
tf-job-operator-796b4747d8-4fh82 1/1 Running 0 21m
After deploy everything, you can run examples to verify the installation.
This is an example for TF operator:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/tfjob-example.yaml
This is an example for PyTorch operator:
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/pytorchjob-example.yaml
Check the
Kubeflow documentation
how to monitor your Experiment
status.
You can view your results in Katib UI.
If you used standard installation, access the Katib UI via Kubeflow dashboard.
Otherwise, port-forward the katib-ui
:
kubectl -n kubeflow port-forward svc/katib-ui 8080:80
You can access the Katib UI using this URL: http://localhost:8080/katib/
.
Katib supports Python SDK:
- Check the Katib v1beta1 SDK documentation.
Run make generate
to update Katib SDK.
To delete installed TF and PyTorch operator run kubectl delete -f
on the respective folders.
To delete Katib run make undeploy
.
Please follow the Kubeflow documentation to submit your first Katib experiment.
We are always growing our community and invite new users and AutoML enthusiasts to contribute to the Katib project. The following links provide information about getting involved in the community:
-
If you use Katib, please update the adopters list.
-
Subscribe to the calendar to attend the AutoML WG community meeting.
-
Check the AutoML WG meeting notes.
-
Learn more about Katib in the presentations and demos list.
- Kubeflow Katib: Scalable, Portable and Cloud Native System for AutoML (by Andrey Velichkevich)
Please feel free to test the system! developer-guide.md is a good starting point for developers.
If you use Katib in a scientific publication, we would appreciate citations to the following paper:
A Scalable and Cloud-Native Hyperparameter Tuning System, George et al., arXiv:2006.02085, 2020.
Bibtex entry:
@misc{george2020katib,
title={A Scalable and Cloud-Native Hyperparameter Tuning System},
author={Johnu George and Ce Gao and Richard Liu and Hou Gang Liu and Yuan Tang and Ramdoot Pydipaty and Amit Kumar Saha},
year={2020},
eprint={2006.02085},
archivePrefix={arXiv},
primaryClass={cs.DC}
}