CUPTI_ERROR_INSUFFICIENT_PRIVILEGES in Docker

Question

CUPTI_ERROR_INSUFFICIENT_PRIVILEGES in Docker

johnbensnyder opened this issue 4 years ago · comments

Ben Snyder commented 4 years ago

GPU profiling in Docker requires including the docker run option '--privileged=true'.

Topic is discussed in this issue:

tensorflow/tensorflow#35860

Can Docker setup instructions be included on the profiler setup page?

https://www.tensorflow.org/guide/profiler

ckluk · Answer 1 · Thu Jun 11 2020 06:11:08 GMT+0800 (China Standard Time)

Thanks for the suggestion. We will add the Docker setup instructions to the profiler guide as suggested.

…

-ck

On Wed, Jun 10, 2020 at 12:35 PM Ben Snyder ***@***.***> wrote: GPU profiling in Docker requires including the docker run option '--privileged=true'. Topic is discussed in this issue: tensorflow/tensorflow#35860 <tensorflow/tensorflow#35860> Can Docker setup instructions be included on the profiler setup page? https://www.tensorflow.org/guide/profiler — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#63>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE33L3JXM3SKI4DX2FXL35LRV7OADANCNFSM4N2V4RYA> .

Dom Miketa · Answer 2 · Sat Jun 13 2020 05:17:44 GMT+0800 (China Standard Time)

Good to hear, @ckluk, thank you! If you're open to taking requests, I'd be very interested in a Docker setup in which:

a) GPU profiling works
b) the container is run as a normal user (so that all newly created files, eg logs and saved models, are owned by the user, not root)

but I can't get both to work at the same time. I have the following in (the host machine's) /etc/modprobe.d/nvidia-kernel-common.conf:
options nvidia "NVreg_RestrictProfilingToAdminUsers=0"
and I ran
update-initramfs -u
after adding it (and rebooted afterwards).

The Docker container is created by
docker run -it --gpus=all --rm --user "$(id -u):$(id -g)" dom/tensorflow:2.2.0-gpu
(plus some volume binds etc). Unfortunately, this setup leads to CUPTI_ERROR_INSUFFICIENT_PRIVILEGES.

ckluk · Answer 3 · Sat Jun 13 2020 05:25:01 GMT+0800 (China Standard Time)

Hi Dom, I don't think we can do much from the Profiler end, as the privilege requirement is from CUPTI. In the future (probably at the timeframe of TF 2.4 release), TF will use CUDA 11. My understanding is that we shouldn't have this CUPTI privilege requirement with CUDA 11. Thanks, -ck

…

On Fri, Jun 12, 2020 at 2:17 PM Dom Miketa ***@***.***> wrote: Good to hear, @ckluk <https://github.com/ckluk>, thank you! If you're open to taking requests, I'd be very interested in a Docker setup in which: a) GPU profiling works b) the container is run as a normal user (so that all newly created files, eg logs and saved models, are owned by the user, not root) but I can't get both to work at the same time. I have the following in (the host machine's) /etc/modprobe.d/nvidia-kernel-common.conf: options nvidia "NVreg_RestrictProfilingToAdminUsers=0" and I ran update-initramfs -u after adding it (and rebooted afterwards). The Docker container is created by docker run -it --gpus=all --rm --user "$(id -u):$(id -g)" dom/tensorflow:2.2.0-gpu (plus some volume binds etc). Unfortunately, this setup leads to CUPTI_ERROR_INSUFFICIENT_PRIVILEGES. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE33L3OHJ22TPYJ3MEXQDO3RWKLQLANCNFSM4N2V4RYA> .

Dom Miketa · Answer 4 · Sat Jun 13 2020 06:07:31 GMT+0800 (China Standard Time)

Thanks @ckluk. I was hoping it's a matter of bad setup, but it's good to hear it'll at least eventually get resolved.

Adam · Answer 5 · Thu Jun 18 2020 06:07:17 GMT+0800 (China Standard Time)

@d-miketa Instead of running the container with --privileged=true, try --cap-add=CAP_SYS_ADMIN

More info: https://developer.nvidia.com/nvidia-development-tools-solutions-err-nvgpuctrperm-cupti

Dom Miketa · Answer 6 · Thu Jul 09 2020 00:12:20 GMT+0800 (China Standard Time)

I ended up doing the following, some subset of which seems to have done the trick:

updating host machine to Ubuntu 20.04
adding options nvidia "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf and running update-initramfs -u
adding export CUDA_VERSION="10.1", export LD_LIBRARY_PATH="/usr/local/cuda-${CUDA_VERSION}/lib64:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/lib64 and export LD_INCLUDE_PATH="/usr/local/cuda-${CUDA_VERSION}/include:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/include" to the host machine's .zshrc
adding ENV LD_INCLUDE_PATH="/usr/local/cuda/include:/usr/local/cuda/extras/CUPTI/include:$LD_INCLUDE_PATH to the Dockerfile
running the Docker container with --privileged

It's possible that --cap-add=CAP_SYS_ADMIN would work as well as --privileged, but I haven't tried.

Dhiren Hamal · Answer 7 · Tue Jan 19 2021 22:14:44 GMT+0800 (China Standard Time)

I ended up doing the following, some subset of which seems to have done the trick:

updating host machine to Ubuntu 20.04

adding options nvidia "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf and running update-initramfs -u

adding export CUDA_VERSION="10.1", export LD_LIBRARY_PATH="/usr/local/cuda-${CUDA_VERSION}/lib64:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/lib64 and export LD_INCLUDE_PATH="/usr/local/cuda-${CUDA_VERSION}/include:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/include" to the host machine's .zshrc

adding ENV LD_INCLUDE_PATH="/usr/local/cuda/include:/usr/local/cuda/extras/CUPTI/include:$LD_INCLUDE_PATH to the Dockerfile

running the Docker container with --privileged

It's possible that --cap-add=CAP_SYS_ADMIN would work as well as --privileged, but I haven't tried.

Hi! how to pass those parameters into Docker container?
I did as follows but got error
nvidia-docker run -d -it --name retina_net -v /home/readib/Experiments/:/ -p 8000:8888 -v /tmp/.X11-unix/:/tmp/.X11-unix -e DISPLAY=$DISPLAY retina_net:latest --cap-add=CAP_SYS_ADMIN /bin/bash

Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "--cap-add=CAP_SYS_ADMIN": executable file not found in $PATH: unknown

Thank you.

Dhiren Hamal · Answer 8 · Tue Jan 19 2021 22:51:10 GMT+0800 (China Standard Time)

In order to run docker:
nvidia-docker run '--privileged=true' -d -it --name retina_net -v /home/readib/Experiments/:/home -p 8000:8888 -v /tmp/.X11-unix/:/tmp/.X11-unix -e DISPLAY=$DISPLAY retina_net:latest /bin/bash