tensorflow / profiler

A profiling and performance analysis tool for TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CUPTI_ERROR_INSUFFICIENT_PRIVILEGES in Docker

johnbensnyder opened this issue · comments

GPU profiling in Docker requires including the docker run option '--privileged=true'.

Topic is discussed in this issue:

tensorflow/tensorflow#35860

Can Docker setup instructions be included on the profiler setup page?

https://www.tensorflow.org/guide/profiler

commented

Good to hear, @ckluk, thank you! If you're open to taking requests, I'd be very interested in a Docker setup in which:

a) GPU profiling works
b) the container is run as a normal user (so that all newly created files, eg logs and saved models, are owned by the user, not root)

but I can't get both to work at the same time. I have the following in (the host machine's) /etc/modprobe.d/nvidia-kernel-common.conf:
options nvidia "NVreg_RestrictProfilingToAdminUsers=0"
and I ran
update-initramfs -u
after adding it (and rebooted afterwards).

The Docker container is created by
docker run -it --gpus=all --rm --user "$(id -u):$(id -g)" dom/tensorflow:2.2.0-gpu
(plus some volume binds etc). Unfortunately, this setup leads to CUPTI_ERROR_INSUFFICIENT_PRIVILEGES.

commented

Thanks @ckluk. I was hoping it's a matter of bad setup, but it's good to hear it'll at least eventually get resolved.

commented

@d-miketa Instead of running the container with --privileged=true, try --cap-add=CAP_SYS_ADMIN

More info: https://developer.nvidia.com/nvidia-development-tools-solutions-err-nvgpuctrperm-cupti

I ended up doing the following, some subset of which seems to have done the trick:

  • updating host machine to Ubuntu 20.04
  • adding options nvidia "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf and running update-initramfs -u
  • adding export CUDA_VERSION="10.1", export LD_LIBRARY_PATH="/usr/local/cuda-${CUDA_VERSION}/lib64:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/lib64 and export LD_INCLUDE_PATH="/usr/local/cuda-${CUDA_VERSION}/include:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/include" to the host machine's .zshrc
  • adding ENV LD_INCLUDE_PATH="/usr/local/cuda/include:/usr/local/cuda/extras/CUPTI/include:$LD_INCLUDE_PATH to the Dockerfile
  • running the Docker container with --privileged

It's possible that --cap-add=CAP_SYS_ADMIN would work as well as --privileged, but I haven't tried.

I ended up doing the following, some subset of which seems to have done the trick:

  • updating host machine to Ubuntu 20.04
  • adding options nvidia "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf and running update-initramfs -u
  • adding export CUDA_VERSION="10.1", export LD_LIBRARY_PATH="/usr/local/cuda-${CUDA_VERSION}/lib64:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/lib64 and export LD_INCLUDE_PATH="/usr/local/cuda-${CUDA_VERSION}/include:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/include" to the host machine's .zshrc
  • adding ENV LD_INCLUDE_PATH="/usr/local/cuda/include:/usr/local/cuda/extras/CUPTI/include:$LD_INCLUDE_PATH to the Dockerfile
  • running the Docker container with --privileged

It's possible that --cap-add=CAP_SYS_ADMIN would work as well as --privileged, but I haven't tried.

Hi! how to pass those parameters into Docker container?
I did as follows but got error
nvidia-docker run -d -it --name retina_net -v /home/readib/Experiments/:/ -p 8000:8888 -v /tmp/.X11-unix/:/tmp/.X11-unix -e DISPLAY=$DISPLAY retina_net:latest --cap-add=CAP_SYS_ADMIN /bin/bash

Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "--cap-add=CAP_SYS_ADMIN": executable file not found in $PATH: unknown

Thank you.

In order to run docker:
nvidia-docker run '--privileged=true' -d -it --name retina_net -v /home/readib/Experiments/:/home -p 8000:8888 -v /tmp/.X11-unix/:/tmp/.X11-unix -e DISPLAY=$DISPLAY retina_net:latest /bin/bash