Mr-Linus/SCV

SCV

SCV is a distributed cluster GPU sniffer. It can cooperate with Yoda-Scheduler to achieve fine-grained GPU scheduling tasks.

GPU metrics that SCV can monitor

Core Frequency
Model
Free Memory
Total Memory
Memory Frequency
Bandwidth
Power
GPU Number

CRD Example

apiVersion: core.run-linux.com/v1
kind: Scv
metadata:
  creationTimestamp: "2020-09-01T06:45:19Z"
  generation: 4
  name: isl-super
  resourceVersion: "88823392"
  selfLink: /apis/core.run-linux.com/v1/scvs/isl-super
  uid: 0fe4de13-34ab-44fc-9454-78a50407c4ad
spec:
  updateInterval: 1000
status:
  cardList:
  - bandwidth: 15760
    clock: 5705
    core: 1911
    freeMemory: 12194
    health: Healthy
    id: 0
    model: TITAN Xp
    power: 250
    totalMemory: 12194
  cardNumber: 1
  freeMemorySum: 12194
  totalMemorySum: 12194
  updateTime: "2020-09-05T11:47:48Z"

Get Started

Ensure that the nvidia container runtime and the nvidia driver are installed on each kubernetes worker node. See nvidia-docker for more details.

Ubuntu

# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
     
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit nvidia-container-runtime
$ sudo systemctl restart docker

Centos

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
    
$ sudo yum install -y nvidia-container-toolkit nvidia-container-runtime
$ sudo systemctl restart docker

Enable the nvidia-container-runtime as docker default runtime on each kubernetes worker node.

You need to modify /etc/docker/daemon.json to the following content on each worker node：

    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        },
        "exec-opts": ["native.cgroupdriver=systemd"],
        "log-driver": "json-file",
        "log-opts": {
          "max-size": "100m"
        },
        "storage-driver": "overlay2",
        "registry-mirrors": ["https://registry.docker-cn.com"]
    }

Deploy the SCV into your kubernetes cluster:

kubectl apply -f https://raw.githubusercontent.com/NJUPT-ISL/SCV/release-2.0/config/crd/bases/core.run-linux.com_scvs.yaml
kubectl apply -f  https://raw.githubusercontent.com/NJUPT-ISL/SCV/master/deploy/deploy.yaml

Mr-Linus / SCV

SCV

GPU metrics that SCV can monitor

CRD Example

Get Started

About

Languages