python installer cuda version mismatch in conda environment

Question

python installer cuda version mismatch in conda environment

HowcanoeWang opened this issue 4 months ago · comments

Problem statement

I got the cuda version mismatch problem when trying to install this package in the Arch Linux.

Since Arch Linux always keeping the system package lastest, it may NOT feasible and convenient to downgrade the system cuda (may cause conflict with other system softwares), so I prefer using conda for different cuda and cudnn version management.

The nvcc / cuda version of the Arch Linux system package is 12.5, its path locate at /opt/cuda/bin/nvcc :

$ nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

Reproduce

Create an cuda=11.8 conda environment by the following commands:

$ conda create -n nerfstudio python=3.10
$ conda activate nerfstudio
(nerfstudio) $ python -m pip install --upgrade pip 
(nerfstudio) $ conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
(nerfstudio) $ conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
(nerfstudio) $ conda install ninja

Check the nvcc version in this environment, we can see it correctly installed cuda=11.8:

(nerfstudio) $ nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

(nerfstudio) $ python -c "import torch; print(torch.__version__)"

2.1.2

(nerfstudio) $ python -c "import torch; print(torch.version.cuda)"

11.8

However, when using this environment to install tiny-cuda-nn package by pip, it gives the following mismatch error:

(nerfstudio) $ pip install 'git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch'

Collecting git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
  Cloning https://github.com/NVlabs/tiny-cuda-nn/ to /tmp/pip-req-build-unexg2nn
  Running command git clone --filter=blob:none --quiet https://github.com/NVlabs/tiny-cuda-nn/ /tmp/pip-req-build-unexg2nn
  Resolved https://github.com/NVlabs/tiny-cuda-nn/ to commit b3473c81396fe927293bdfd5a6be32df8769927c
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: tinycudann
  Building wheel for tinycudann (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [74 lines of output]
      /tmp/pip-req-build-unexg2nn/bindings/torch/setup.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
        from pkg_resources import parse_version
      Building PyTorch extension for tiny-cuda-nn version 1.7
      Obtained compute capability 86 from PyTorch
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2022 NVIDIA Corporation
      Built on Wed_Sep_21_10:33:58_PDT_2022
      Cuda compilation tools, release 11.8, V11.8.89
      Build cuda_11.8.r11.8/compiler.31833905_0
      Detected CUDA version 11.8
      Targeting C++ standard 17
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-310
      creating build/lib.linux-x86_64-cpython-310/tinycudann
      copying tinycudann/__init__.py -> build/lib.linux-x86_64-cpython-310/tinycudann
      copying tinycudann/modules.py -> build/lib.linux-x86_64-cpython-310/tinycudann
      running egg_info
      creating tinycudann.egg-info
      writing tinycudann.egg-info/PKG-INFO
      writing dependency_links to tinycudann.egg-info/dependency_links.txt
      writing top-level names to tinycudann.egg-info/top_level.txt
      writing manifest file 'tinycudann.egg-info/SOURCES.txt'
      reading manifest file 'tinycudann.egg-info/SOURCES.txt'
      writing manifest file 'tinycudann.egg-info/SOURCES.txt'
      copying tinycudann/bindings.cpp -> build/lib.linux-x86_64-cpython-310/tinycudann
      running build_ext
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-req-build-unexg2nn/bindings/torch/setup.py", line 189, in <module>
          setup(
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/__init__.py", line 104, in setup
          return distutils.core.setup(**attrs)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 184, in setup
          return run_commands(dist)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 200, in run_commands
          dist.run_commands()
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/dist.py", line 967, in run_command
          super().run_command(command)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 368, in run
          self.run_command("build")
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
          self.distribution.run_command(command)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/dist.py", line 967, in run_command
          super().run_command(command)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 132, in run
          self.run_command(cmd_name)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
          self.distribution.run_command(command)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/dist.py", line 967, in run_command
          super().run_command(command)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 91, in run
          _build_ext.run(self)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
          self.build_extensions()
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 525, in build_extensions
          _check_cuda_version(compiler_name, compiler_version)
        File "/home/crest/Applications/miniconda3/envs/nerfstudio/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 413, in _check_cuda_version
          raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
      RuntimeError:
      The detected CUDA version (12.5) mismatches the version that was used to compile
      PyTorch (11.8). Please make sure to use the same CUDA versions.
    
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tinycudann
  Running setup.py clean for tinycudann
Failed to build tinycudann
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (tinycudann)

The strange thing is, according to its output log, its successfully detected cuda version 11.8

      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2022 NVIDIA Corporation
      Built on Wed_Sep_21_10:33:58_PDT_2022
      Cuda compilation tools, release 11.8, V11.8.89
      Build cuda_11.8.r11.8/compiler.31833905_0
      Detected CUDA version 11.8
      Targeting C++ standard 17

but seems for compiling, it uses the system cuda version 12.5.

What I have tried

change env variables

Setting the path and other environment variable in cmd like this does not work:

export PATH="/home/hwang/Applications/miniconda3/envs/nerfstudio/bin:$PATH"
export LD_LIBRARY_PATH="/home/hwang/Applications/miniconda3/envs/nerfstudio/lib:$LD_LIBRARY_PATH"

I even tried setup in .bashrc and .zshrc, after setting, the system nvcc version even changed to that conda environment version but still not work:

$ nvcc -V   # with conda environment deactivated

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

delete system cuda package

I renamed the system cuda to another name:

sudo mv /opt/cuda /opt/cuda.back

And the installer failed with unable to find /opt/cuda/bin/nvcc even it detected the conda nvcc:

/home/hwang/Documents/Github/tiny-cuda-nn/bindings/torch/setup.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
          from pkg_resources import parse_version
        nvcc: NVIDIA (R) Cuda compiler driver
        Copyright (c) 2005-2022 NVIDIA Corporation
        Built on Wed_Sep_21_10:33:58_PDT_2022
        Cuda compilation tools, release 11.8, V11.8.89
        Build cuda_11.8.r11.8/compiler.31833905_0
        Building PyTorch extension for tiny-cuda-nn version 1.7
        Obtained compute capability 89 from PyTorch
        Detected CUDA version 11.8
        Targeting C++ standard 17
        running develop
        /home/hwang/Applications/miniconda3/envs/sdfstudio/lib/python3.10/site-packages/setuptools/command/develop.py:40: EasyInstallDeprecationWarning: easy_install command is deprecated.

        ...

          self.initialize_options()
        running egg_info
        writing tinycudann.egg-info/PKG-INFO
        writing dependency_links to tinycudann.egg-info/dependency_links.txt
        writing top-level names to tinycudann.egg-info/top_level.txt
        reading manifest file 'tinycudann.egg-info/SOURCES.txt'
        writing manifest file 'tinycudann.egg-info/SOURCES.txt'
        running build_ext
        error: [Errno 2] No such file or directory: '/opt/cuda/bin/nvcc'

Expected solution

Any ideas or suggestions for allowing the installer using the path to the conda environment nvcc rather than the system one?

Thomas Müller · Answer 1 · Fri Jul 12 2024 13:15:21 GMT+0800 (China Standard Time)

Hi, thanks for the detailed report! I have to admit that I have no idea why the build system defaults to the system CUDA rather than the one that is foremost in your conda environment PATH.

For reference: the code for setting up the Python bindings is in setup.py, which, according to your logs, does seem to correctly detect CUDA 11.8.

Like you say, it's only the build itself that seems to use the system's CUDA -- and this is parameterized by Python's setuptools rather than ourselves (see the calls to CUDAExtension(...) and setup(...)). In their documentation (1, 2), I don't see a straightforward way to parameterize the build system itself.

As a workaround, you could try temporarily prepending your conda's CUDA bin path to your system's PATH variable and its lib path to your system's LD_LIBRARY_PATH and then removing those again once the build succeeded.

浩瀚猫 · Answer 2 · Fri Jul 12 2024 13:41:34 GMT+0800 (China Standard Time)

@Tom94 Thanks a lot for your reply, by checking the source code of torch/utils/cpp_extension.py, I found it read the cuda path from both CUDA_HOME and CUDA_PATH:

def _find_cuda_home() -> Optional[str]:
    r'''Finds the CUDA install path.'''
    # Guess #1
    cuda_home = os.environ.get('CUDA_HOME') or os.environ.get('CUDA_PATH')

Then I checked those two environmental variable, I found the CUDA_PATH mislead the complier:

$ echo $CUDA_PATH 
/opt/cuda

By changing it to conda enviroment, I fixed the issue (but coming with other compling issues, not related to this topic)