To make bare metals managed and work together
We will walk you through the steps to create a production grade GPU cluster for scientific computing placed in data center with all the open source tools from ground up.
We assume you have an server room or a data center rental, with some empty racks. External network, electricity and cooling are ready.
- For a group of AI researchers:
- should be easy to experiment their ides, model training/tuning.
- resource should be fairly distributed.
- data should be safely stored
- performance (IO, computing)
- For maintainer
- should be able to have a over look of the whole cluster
- need to be notified when there is any hardware issue
- should be able to resolve most of the issue remotely
- automation
- extensible
- For Management
- low cost
- no exception
- Capacity
- Speed
- Redundancy
- Cost
- Kubespray A production grade k8s deployment tool, works with ansible
- Kubernetes Cluster Deployment on InfiniBand Fabric with RDMA Shared Device Plugin.