GPU isolation options

Question

GPU isolation options

andy108369 opened this issue 6 months ago · comments

We want to make sure one cannot request more AMD GPU than he should by using certain environment variables. (e.g. HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES).
I am not sure whether this is an issue as of today, we cannot verify this since we don't have a box with more than one AMD GPU at the present time.

To bring more clarity, it is possible to expose access to all NVIDIA GPU on the Host via NVIDIA_VISIBLE_DEVICES=all env. variable set to the Pod. Luckily, we were able to work it around by setting --set deviceListStrategy=volume-mounts for nvdp/nvidia-device-plugin helm chart along with these configs in /etc/nvidia-container-runtime/config.toml file:

accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false