iot-salzburg / gpu-jupyter

GPU-Jupyter: Leverage the flexibility of Jupyterlab through the power of your NVIDIA GPU to run your code from Tensorflow and Pytorch in collaborative notebooks on the GPU.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CUDA version incompatibility

omarabb315 opened this issue · comments

GPU-Jupyter Issue Report

Issue Description

when I pull and run those images:

v1.6_cuda-12.0_ubuntu-22.04_python-only, v1.6_cuda-11.8_ubuntu-22.04_python-only, v1.5_cuda-12.0_ubuntu-22.04_python-only, v1.5_cuda-11.8_ubuntu-22.04_python-only

, and then run (nvcc --version) I get result showing that the cuda version is 12.3 even though I choose images with different cuda versions.

To Reproduce

sudo docker run -it --rm --gpus all cschranz/gpu-jupyter:v1.5_cuda-11.8_ubuntu-22.04_python-only nvcc --version

Expected Behavior

results showing cuda version of 11.8

Screenshots

image

Environment

Operating System:
Ubuntu 22.04

NVIDIA GPU and CUDA version Details:
image

GPU-Jupyter Version:

any of those: v1.6_cuda-12.0_ubuntu-22.04_python-only, v1.6_cuda-11.8_ubuntu-22.04_python-only, v1.5_cuda-12.0_ubuntu-22.04_python-only, v1.5_cuda-11.8_ubuntu-22.04_python-only

Docker command and parameters:

sudo docker run -it --rm --gpus all cschranz/gpu-jupyter:v1.5_cuda-11.8_ubuntu-22.04_python-only nvcc --version

Browser (if applicable):

firefox

Hi @omarabb315
Unfortunately this resulted from an undesired CUDA-update on the build-machine.

As a temporal solution you build the image locally (until the images will be rebuilt):

./generate-Dockerfile.sh --python-only
docker build -t gpu-jupyter .build/  # will take a while
docker run --gpus all -d -it -p 8848:8888 -v $(pwd)/data:/home/jovyan/work -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes -e NB_UID="$(id -u)" -e NB_GID="$(id -g)" --user root --restart always --name gpu-jupyter_1 gpu-jupyter

Thank you for your reply, I built the image locally after cloning the repo, and used your command but still resulting the same output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

did I miss something?

This is because of

# reinstall nvcc with cuda-nvcc to install ptax
USER $NB_UID
RUN mamba install -c nvidia cuda-nvcc -y && \
mamba clean --all -f -y && \
fix-permissions $CONDA_DIR && \
fix-permissions /home/$NB_USER

which installs https://anaconda.org/nvidia/cuda-nvcc: Currently version 12.3.107.

(This has nothing to do with the CUDA-version on the build-machine)

So how can I solve it , knowing that I used the lines @ChristophSchranz mentioned?

@omarabb315 You can ask mamba to install a specific version, so amend the mamba install command to be cuda-nvcc=12.2.140 instead (as an example).

@yankcrime Thank you for your help, I am wondering why did you recommend in the pull request to pin cuda-nvcc to 12.2 while the base image has different cuda version (12.0.1) ---> nvidia/cuda:12.0.1-cudnn8-runtime-ubuntu22.04?

what about using cuda-nvcc=12.0.140?

Thanks for the PR #130 @yankcrime !
I've adapted it to pin the CUDA version to 12.0 (as the GPU-libs don't support higher version officially yet) and as @omarabb315 suggested (cuda-nvcc=12.0.140).

Should be closed with #135

Please let me know in case this error still occurs!

Please let me know in case this error still occurs!

@ChristophSchranz Thank you for your help, after pulling the new image , I still get those messages after importing TensorFlow:

Screenshot 2024-01-17 020036

and I believe this is the reason behind crashing my multi-GPU training

Hi @omarabb315,

I could reproduce your issue.
I think that the preinstalled cudNN version of nvidia/cuda is 8.6, thus throwing warnings in Tensorflow higher than 2.13 (see here). Unfortunately, TF 2.13 (also the installation with-cuda) results in TF not finding cuda anymore.

Can you show me the output of

python -c "import tensorflow; print(tensorflow.__version__); print(tensorflow.test.is_built_with_cuda())"
python -c "from tensorflow.python.client import device_lib; device_lib.list_local_devices()"

And have you verified that this issue affects the performance? TF throws a lot of warnings, which are in many cases not >that< relevant. See here.

Thanks for feedback and Sorry for late response @ChristophSchranz
yes here is my output of your desired commands:

Screenshot 2024-02-03 153855

And no I didn't make sure that it is affecting the performance, but I am administrating a university server with JupyterHub and I need to ensure every thing is working correctly and compatible.