assafna / bright-cluster-manager-vm

NVIDIA Bright Cluster Manager instructions for installing on a VM and using NVIDIA GPUs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bright Cluster Manager Tutorial with VMware

NVIDIA Bright Cluster Manager instructions for installing on a VM and using NVIDIA GPUs.

Much of the information provided relies on NVIDIA Bright Cluster Manager documentation.

Introduction

NVIDIA Bright Cluster Manager offers fast deployment and end-to-end management for heterogeneous high-performance computing (HPC) and AI server clusters at the edge, in the data center, and in multi/hybrid-cloud environments. It automates provisioning and administration for clusters ranging in size from a couple of nodes to hundreds of thousands, supports CPU-based and NVIDIA GPU-accelerated systems, and enables orchestration with Kubernetes.

This repository will include the instructions for installing and running Bright Cluster Manager on VMware vSphere VMs and using vGPUs.

Requirements

  • Bright Cluster Manager license.
  • vGPU license.

Head node installation

  1. Download Bright's ISO file from Bright's download page.
    • Architecture: x86_64/amd64.
    • Linux Base Distribution: Ubuntu 20.04.
    • Hardware Vendor: Generic / Other.
    • Additional Features: mark Include CUDA Packages.
      • Note: mark Include OFED and OPA Packages and Include NVIDIA DGX A100 software image if needed. This will create an additional software image for the DGX.
  2. Upload the ISO file to vSphere's datastore.
  3. Create a new VM with the following settings:
    • Name (optional): bright_head_node.
    • Guest OS: Linux - Ubuntu (64-bit).
    • Virtual Hardware:
      • CPU: >= 4 CPUs.
      • Memory: >= 16 GB.
      • Hard disk: >= 128 GB.
        • Note: external storage might be used.
      • Create two network adapters:
        • An external network.
        • An internal network.
      • CD/DVD drive: Datastore ISO File.
        • Select Bright's ISO file.
        • Mark the "Connected" and "Connect At Power On" checkboxes.
    • VM Options:
      • Boot Options: Firmware - EFI.
  4. Launch the VM and connect to it (recommended through the remote console).
  5. Follow Bright's Graphical installer and note the following:
    • Workload manager: None.
      • Note: A workload manager will be installed later. This is due to the fact that Pyxis and Enroot will NOT be installed if Slurm is chosen in this stage.
    • Network topology: Type1.
    • Head node:
      • Hardware manufacturer: Other.
    • Networks:
      • externalnet:
        • Keep DHCP marked.
      • internalnet.
      • Note: make sure the correct networks are set.
    • Head node interfaces:
      • External network:
        • Network: externalnet.
        • IP address: DHCP.
      • Internal network:
        • Network: internalnet.
      • Note: make sure the correct networks are set.
    • Compute nodes interfaces:
      • Interface: BOOTIF.
      • Network: internalnet.
    • Additional software: mark the CUDA checkbox.
      • Note: mark the OFED checkbox if needed.
    • Complete the installation.
  6. After installation was completed:
    • Choose to reboot the VM.
    • In the VM settings, unmark the Connected checkbox from the head node VM CD/DVD drive.
    • Restart the VM.

Head node post-installation

  1. Launch and SSH to the head node with the root username and the password chosen during installation.

  2. Confirm the node is visible to the internet with ping www.google.com.

  3. Update the node with apt -y update, apt -y upgrade and apt -y autoremove.

  4. Install Bright's license by running: request-license.

    • Note: valid details are optional.
    • Note: in case the cluster is in a dark-site and air-gapped environment:
      • Run request-license to generate a CSR (certificate request).
      • Move the certificate to the licensing server of Bright to get a signed license.
      • Copy the license back to the cluster and install it using install-license.
  5. Optional: change the home directory to an external drive with either:

    • Editing the fsmounts by running:

      cmsh
      category use <category-name>
      fsmounts
      use /home
      set device <hostname/IP of the NAS>:</path/to/export>
      commit
      
    • Running cmha-setup.

      • Note: this option is only meant for HA and includes moving /cm/shared and /home paths to an external shared storage (NAS, DAS or DRBD).
  6. Optional: if needed, fix compute nodes DNS by running:

    cmsh
    category use <category-name>
    append nameservers <nameserver>
    commit
    quit
    
    • Note: the nameservers is empty, therefore any existing nameservers should also be added.
    • Note: order of nameservers is important.
  7. The following changes should be made for each software image:

    1. View all images:

      cmsh
      softwareimage
      list
      
    2. Clone the relevant image to a new image:

      softwareimage
      clone <from-image> <image-name>
      commit
      
      • Note: wait for Initial ramdisk for image <image-name> was generated successfully message to appear.
    3. Clone the default category to a new category and assign the relevant image:

      category
      clone default <category-name>
      set softwareimage <image-name>
      commit
      
    4. Assign the relevant nodes to the relevant category:

      device
      set <node-name> category <category-name>
      commit
      quit
      
    5. Update the software image by running:

      cm-chroot-sw-img /cm/images/<image-name>
      apt -y update
      apt -y upgrade
      apt -y autoremove
      exit
      
    6. Update the kernel if a newer version is available by running:

      cmsh
      softwareimage
      use <image-name>
      show | grep "Kernel version"
      kernelversions
      
      • Compare the versions, if a newer version is available and not set for the software image, set it by running:
      set kernelversion <kernel-version>
      commit
      
      • Note: wait for Initial ramdisk for image <image-name> was generated successfully message to appear, then run quit.
  8. Install a workload manager by running: cm-wlm-setup and note the following for Slurm:

    • TODO: convert the following into a one liner.
    • Choose Setup (Step By Step).
    • Choose Slurm.
    • Optional: keep cluster name slurm.
    • Keep the head node only for server role.
    • Optional: keep overlay configuration.
    • Optional: unselect everything for client role.
    • Optional: unselect everything for client role.
    • Optional: keep overlay configuration.
    • Optional: keep prejob healthchecks empty.
    • Select yes for GPU resources settings.
    • Optional: keep settings for configuration overlay.
    • Select all categories that include a GPU for GPU role.
      • TODO: update the image.
    • Keep the head node unselected for GPU role.
    • Optional: keep settings for configuration overlay.
    • Optional: keep slots amount empty.
    • Keep selected for submit role.
    • Keep the head node for submit role.
    • Optional: keep default settings for overlay.
    • Optional: keep accounting configuration.
    • Optional: keep the head node for storage host.
    • Optional: select no for Slurm power saving.
    • Select Automatic NVIDIA GPU configuration.
    • Modify the Count column with the number of GPUs per compute node. No need to enter any other details.
      • Note: in case there are different number of GPUs in different compute nodes:
        • Set the number of GPUs for one version of a compute node (e.g., a compute node with 2 GPUs).
        • After the installation is complete duplicate the configuration and modify it for any other version (explained in the next bullet).
    • Select yes for configuring Pyxis plugin.
    • Optional: keep Cgroups constraints empty.
    • Optional: keep default queue name.
    • Select Save config & deploy.
    • Optional: save the configuration file in the default location.
    • Complete the setup.
    • Note: if an error of Temporary failure resolving 'archive.ubuntu.com' appears undo the installation by pressing u, then try the following solutions for each software image and reinstall:
      • Relink resolv.conf:

        cm-chroot-sw-img /cm/images/<image-name>
        ln -s ../run/systemd/resolve/resolv.conf /etc/resolv.conf
        quit
        
      • Manually install Enroot:

        cm-chroot-sw-img /cm/images/<image-name>
        wget https://github.com/NVIDIA/enroot/releases/download/v3.4.0/enroot_3.4.0-1_amd64.deb
        apt -y install ./enroot_3.4.0-1_amd64.deb
        apt -y remove enroot\*
        rm ./enroot_3.4.0-1_amd64.deb
        exit
        
        • Note: use the latest Enroot's release.
      • Note: if Slurm still exist, remove it before reinstalling by running the cm-wlm-setup script and choosing Disable.

      • Note: if previous configuration overlays still exist, remove them before reinstalling:

        cmsh
        configurationoverlay
        list
        
        • Remove all listed by running remove <configuration-overlay> for each one, then exit with quit.
  9. In case there are different number of GPUs in different compute nodes, clone and set the configuration overlays:

    • First, set the categories of the original configuration overlay so it won't include the different categories:
    cmsh
    configurationoverlay
    use <configuration-overlay>
    set categories <category-name>
    commit
    
    • Then, clone the configuration overlay and set the different categories:
    clone <from-configuration-overlay> <configuration-overlay>
    set categories <category-name>
    roles
    use slurmclient
    genericresources
    set gpu0 count <number-of-gpus>
    commit
    quit
    
  10. Load Slurm by default on the head node by running module initadd slurm.

  11. Optional: tmpfs /run volume is used as a cache for running the containers and configured automatically based on the compute nodes hard disks. To view its size run df -h in a compute node. To override the configuration use:

    cmsh
    category use <category-name>
    fsmounts
    clone /dev/shm /run
    set mountoptions "defaults,size=<new size>"
    commit
    quit
    
  12. There's an issue with nvidia-uvm kernel and vGPUs that require some initialize. The issue involves this model being loaded but the path to /dev/nvidia-uvm is missing. This can be observed on a compute node by running env | grep _CUDA_COMPAT_STATUS. To overcome this issue do the following (provided by Adel Aly):

    • Enter to the software image by running: cm-chroot-sw-img /cm/images/<image-name>.

    • Create a new file /lib/systemd/system/nvidia-uvm-init.service with the following content:

      # nvidia-uvm-init.service
      # loads nvidia-uvm module and creates /dev/nvidia-uvm device nodes
      [Unit]
      Description=Initialize nvidia-uvm device on vGPU passthrough
      [Service]
      ExecStart=/usr/local/bin/nvidia-uvm-init.sh
      [Install]
      WantedBy=multi-user.target
      
    • Create a new file /usr/local/bin/nvidia-uvm-init.sh with the following content:

      #!/bin/bash
      ## Script to initialize nvidia device nodes.
      ## https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-verifications
      /usr/sbin/modprobe nvidia
      if [ "$?" -eq 0 ]; then
      # Count the number of NVIDIA controllers found.
      NVDEVS=`lspci | grep -i NVIDIA`
      N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
      NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`
      N=`expr $N3D + $NVGA - 1`
      for i in `seq 0 $N`; do
          mknod -m 666 /dev/nvidia$i c 195 $i
      done
      mknod -m 666 /dev/nvidiactl c 195 255
      else
      exit 1
      fi
      /sbin/modprobe nvidia-uvm
      if [ "$?" -eq 0 ]; then
      # Find out the major device number used by the nvidia-uvm driver
      D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
      mknod -m 666 /dev/nvidia-uvm c $D 0
      mknod -m 666 /dev/nvidia-uvm-tools c $D 0
      else
      exit 1
      fi
      
    • Change the permissions of the script file by running: chmod 777 /usr/local/bin/nvidia-uvm-init.sh.

    • Enable the service and exit:

      systemctl enable nvidia-uvm-init.service
      exit
      
  13. Optional: add users.

    • Note: Slurm is not loaded by default for the users. To enable Slurm for all users by default, edit the /etc/skel/.bashrc file and add:

      # load Slurm
      module load shared
      module load slurm
      
  14. Reboot the head node.

Compute nodes installation

  1. Create a new VM with the following settings:

    • Name (optional): node01.
    • Guest OS: Linux - Ubuntu (64-bit).
    • Virtual Hardware:
      • CPU: >= 8 CPUs.
      • Memory: >= 16 GB.
      • Hard disk: 64 GB * number of users.
      • Network adapter:
        • An internal network.
      • Create PCI devices per GPU.
    • VM Options:
      • Boot Options: Firmware - EFI.
  2. Duplicate the VM for any other compute node and change the name accordingly.

  3. Launch the first compute node VM and connect to it (recommended through the remote console).

  4. The node should be PXE booted by the head node.

    • Note: if an error is present try to reboot the node.
    • Choose the relevant node and provision it with the FULL option.
  5. SSH to the node from the head node for easier access.

  6. Update the node with apt -y update, apt -y upgrade and apt -y autoremove.

  7. For vGPU:

    1. Uninstall the existing CUDA driver with sudo apt -y remove --purge cuda-driver.

    2. Install the vGPU driver:

      • TODO: add images.
      • Download the vGPU driver from NVIDIA Application Hub -> NVIDIA Licensing Portal -> Software Downloads.
      • Copy the vGPU driver file ending with .run to the compute node.
      • Run cmhod +x <file path>.
      • Run the installation file.
      • Keep DKMS disabled for the kernel.
      • Accept the warning.
      • Select No for installing NVIDIA's 32-bit compatibility libraries.
      • Accept the warning.
      • Run nvidia-smi and make sure the GPUs are visible.
      • Remove the installation file.
    3. Install NVIDIA CUDA Container Toolkit (installation guide):

      • Setup the package repository and the GPG key by running:
      distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
              sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
              sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
      
      • Install the nvidia-docker2 package (and dependencies) after updating the package listing:
      apt -y update
      apt install -y nvidia-docker2
      
      • Restart the Docker daemon to complete the installation after setting the default runtime by running: systemctl restart docker.
  8. Make sure /home directory is mounted by running cat /etc/fstab | grep "master:/home".

  9. Login to the head node.

  10. Grab the node image to your relevant image by running:

    cmsh
    device
    grabimage -i <image-name> -w <node-name>
    
    • Note: the grabbing process might take a few minutes. Wait until a grabimage [ COMPLETED ] message appears, then run quit to exit.
  11. For vGPU, install the vGPU license:

  12. Reboot the head node.

  13. Reboot all compute nodes one by one, link each new one to its name and allow the default provisioning.

    • Note: make sure the vGPU license is installed by running nvidia-smi -q | grep -A2 "vGPU Software Licensed Product" within a compute node.

Note: if a DGX fails to show GPUs when running nvidia-smi it might be because the driver wasn't build with a newer kernel. To solve it, reinstall the driver by running apt reinstall nvidia-driver-<version> and grab the image via CMSH.

Running example

This example uses Horovod and TensorFlow.

  1. Login to the head node.

  2. Pull the latest NGC TensorFlow container by running: enroot import 'docker://nvcr.io#nvidia/tensorflow:<TensorFlow version>'.

    • Note: This will create a local .sqsh file.
  3. Git clone Horovod's GitHub repository by running: git clone https://github.com/horovod/horovod.

  4. Submit a Slurm job by running:

    srun --mpi=pmix \
    -G <number of GPUs> \
    --container-image=<path to TensorFlow sqsh file> \
    --container-mounts=<path to Horovod GitHub directory>:/code \
    python /code/examples/tensorflow/tensorflow_synthetic_benchmark.py
    
    • Note: if an error of Invalid MPI plugin name is received when running a Slurm job with --mpi=pmix it is probably because of a missing package. To solve it:
      1. SSH to the relevant node.
      2. Run /cm/shared/apps/cm-pmix3/3.1.4/bin/pmix_info to view the issue:
      3. Install the relevant package by running apt install libevent-pthreads-2.1-7.
      4. Grab the image.
  5. Examine the results and observe the GPU usage.

About

NVIDIA Bright Cluster Manager instructions for installing on a VM and using NVIDIA GPUs