torch gpu problem

Question

torch gpu problem

arnirs opened this issue 5 months ago · comments

Issue Description

Describe the issue
Hi and thanks for your hard works. Unfortunately, pytorch gpu is not working in v1.6_cuda-11.8_ubuntu-22.04 image. It says that no cuda 12.1 found. In fact torch now supports cuda 12 and the docker file used for building v1.6_cuda-11.8_ubuntu-22.04 image does not explicitly state the cuda version while installing pytorch and because of that the cuda 12 compatible pytorch will be installed. For installing pytorch with cuda 11.8, we should use following commands:
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
or for the latest pytorch:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

To Reproduce
Steps to reproduce the behavior:
docker run --gpus all -d -it -p 8848:8888 -v $(pwd)/data:/home/jovyan/work -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes --user root cschranz/gpu-jupyter:v1.6_cuda-11.8_ubuntu-22.04

log into Jupyterlab

run following in a notebook:
print(torch.version.cuda)

Expected Behavior
Torch must use CUDA 11.8.

Environment

Operating System:
Ubuntu 22.04.3

NVIDIA GPU and CUDA version Details:
CUDA 11.8
NIVIDA driver 520.xxx

GPU-Jupyter Version:
v1.6_cuda-11.8_ubuntu-22.04

Thanks in advance.

Darren Reid · Answer 1 · Sun Feb 25 2024 06:59:55 GMT+0800 (China Standard Time)

If you add the environment variable TORCH_CUDA_ARCH_LIST I have been able to get it to load correctly. Eg,

version: "3.8"
services:
  gpu-jupyter:
    container_name: gpu-jupyter
    build: .build
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
              - gpu
    # # Set hardware limits: one GPU, max. 48GB RAM, max. 31 GPUs
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           capabilities: [gpu]
    #           device_ids: ["0"]  # select one GPU
    #     limits:
    #       cpus: "31.0"
    #       memory: 48g
    ports:
      - 10000:8888
    volumes:
      - ./data:/home/jovyan/work
    environment:
      GRANT_SUDO: "yes"
      JUPYTER_ENABLE_LAB: "yes"
      NB_UID: ${JUPYTER_UID:-1000}
      NB_GID: ${JUPYTER_GID:-1000}
      JUPYTER_TOKEN: ${JUPYTER_TOKEN}
      TORCH_CUDA_ARCH_LIST: 8.6
    # enable sudo permissions
    user:
      "root"
    restart: always

Hope that helps.

Christoph · Answer 2 · Thu Feb 29 2024 23:00:40 GMT+0800 (China Standard Time)

Hi,
The problems origins from the torch installation routine that was suggested at the time. It updates the CUDA version and corrupts the installation. Now, pytorch suggests an installation with fixed cuda version (again) and I changed it in this commit: 2ac3181

It wonders me that it occurs now for this image, as it worked in the tests. Is this error fixed if you build the image based on the repository? If yes, I'll update the image tag.

@Layoric Thanks for the quick fix! As the origin is a corrupted cuda installation, I'll fix the origin.

Christoph · Answer 3 · Thu Mar 21 2024 18:06:43 GMT+0800 (China Standard Time)

The commit 9982802 should provide a clear solution for this problem in version v1.6_cuda-11.8_ubuntu-22.04. For the pip install of Pytorch, the index-url is pinned for CUDA 11.8.

Please re-open if the issue still occurs.