pdube / kubernetes-cloud

Getting Started with the CoreWeave Kubernetes GPU Cloud

Home Page:http://www.coreweave.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CoreWeave Kubernetes Cloud

Introduction

The Kubernetes environment enables a flexible and reliable method of deploying workloads and services on CoreWeave's Accelerated Compute Cloud.

Deployment examples

Please see the folders in this repository for ready to deploy Kubernetes manifest examples.

Node Labels

Selecting the right hardware for your workload is important. All compute nodes are tagged with a set of labels specifying the hardware type that is available inside. Affinity Rules should be leveraged on workloads to ensure that the desired type of hardware (ie. GPU model) gets assigned to the Pod. The following labels are currently available.

Label Possible Values Description
cpu.coreweave.cloud/family i9, i7, i5, celeron, xeon, epyc The CPU family of the CPU in the node
ethernet.coreweave.cloud/speed 1G, 10G The uplink speed from the node to the backbone
gpu.nvidia.com/count 4-8 Number of GPUs provisioned in the node. Using this selector is not recommended as the GPU resource requests are the correct method of selecting GPU count requirement
gpu.nvidia.com/class Tesla_V100 (see list) GPU model provisioned in the node
gpu.nvidia.com/vram 8, 16 GPU VRAM in Gigabytes on the GPUs provisioned in the node
gpu.nvidia.com/nvlink true, false Denotes if GPUs are interconnected with NVLink
pci.coreweave.cloud/version 1, 2, 3, 4 PCI Express Version for GPU interfaces
pci.coreweave.cloud/speed 2.5, 5, 8, 16 PCI Express Link Speed for GPU interfaces in GT/s
pci.coreweave.cloud/lanes 1, 4, 16 PCI Express Lanes (Bus width) for GPU interfaces
topology.kubernetes.io/region ORD1, EWR1, EWR2, BUF1 The region the node is placed in

GPU Availability

Vendor Class Generation CUDA Cores VRAM Label
NVIDIA Tesla V100 NVLINK Volta 5,120 16 GB Tesla_V100_NVLINK
NVIDIA Tesla V100 Volta 5,120 16GB Tesla_V100
NVIDIA Multi Purpose Turing Turing 2,000+ 8+ GB NV_Turing
NVIDIA Tesla P100 Pascal 3,584 16 GB Tesla_P100_NVLINK
NVIDIA Multi Purpose Pascal Pascal 2,000+ 8 GB NV_Pascal

System Resources

Each GPU includes a certain amount of host CPU and RAM, these are included at no additional fee.

Class vCPU RAM Great For
Tesla V100 NVLINK 4 Xeon Gold 32 GB Deep learning, neural network training, HPC
Tesla V100 3 16 GB AI inference, rendering, batch processing, hashcat
Mutli Purpose Turing 3 16 GB Machine learning, rendering, batch processing
Tesla P100 6 32 GB Entry level HPC, rendering, batch processing
Multi Purpose Pascal 1 8 GB Video transcoding, rendering, batch processing

A workload requesting more resources than allowed for the specific GPU class will have its resources capped to the maximum allowable amount.

For example, launching a Pod with a request for Mutli Purpose Pascal GPUs will have its resource request capped to 2 CPU and 16GB RAM.

Getting Started

Install Kubernetes Command Line Tools

Cut-and-paste instructions are below. For more detail please reference the official documentation.

Mac OS

brew install kubectl

Linux

curl -LO https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl

Set Up Access

You will have received a pre-populated kube-config file from CoreWeave as part of your onboarding package. The snippet below assumes that you have no other Kubernetes credentials stored on your system, if you do you will need to open both files and copy the cluster, context and user from the supplied kube-config file into your existing ~/.kube/config file.

Replace ~/Downloads with the path to the kube-config supplied by CoreWeave.

mkdir -p ~/.kube/
mv ~/Downloads/kube-config ~/.kube/config

Verify Access

Since your new account will not have any resources, listing the secrets is a good start to make sure proper communication with the cluster.

$ kubectl get secret                                                                                                                                                                                                                            git:(master|…
NAME                           TYPE                                  DATA   AGE
default-token-frqgm            kubernetes.io/service-account-token   3      5d3h

Once access is verified you can deploy the examples found in this repository.

About

Getting Started with the CoreWeave Kubernetes GPU Cloud

http://www.coreweave.com


Languages

Language:Shell 76.1%Language:Jupyter Notebook 12.9%Language:Python 8.9%Language:Dockerfile 2.2%