gotch.CUDA.IsAvailable() return false but cuda is available in fact

Question

gotch.CUDA.IsAvailable() return false but cuda is available in fact

XiongUp opened this issue 2 years ago · comments

1. `nvcc -V` shows cuda version is 10.2

(base) root@7c8a9f63a67f:/dev/shm/gotorch_projects/test1# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

2. `export CUDA_VER=10.2 && bash setup-libtorch.sh` is successful.

3. Environment is ok. Following lines have been added to .bashrc file（/usr/lib64-nvidia is not found）

export GOTCH_LIBTORCH="/usr/local/lib/libtorch"
export LIBRARY_PATH="$LIBRARY_PATH:$GOTCH_LIBTORCH/lib"
export CPATH="$CPATH:$GOTCH_LIBTORCH/lib:$GOTCH_LIBTORCH/include:$GOTCH_LIBTORCH/include/torch/csrc/api/include"
LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$GOTCH_LIBTORCH/lib:/usr/local/cuda-10.2/lib64"

4. `export CUDA_VER=10.2 && export GOTCH_VER=v0.7.0 && bash setup-gotch.sh` is successful.

5. my example

package main

import (
	"fmt"

	"github.com/sugarme/gotch"
)

func main() {
	if gotch.CUDA.IsAvailable() {
		fmt.Println("CUDA is available")
	} else {
		fmt.Println("No Find")
	}
}

Result shows "No Find", because gotch.CUDA.IsAvailable() return false.
In addition, I can train the neural network using the GPU by gotorch (https://github.com/wangkuiyi/gotorch) or pytorch. So cuda is available in fact!

sugarme · Answer 1 · Wed Apr 13 2022 09:56:54 GMT+0800 (China Standard Time)

@XiongUp,

I suspect your system was installed with libtorch for CPU due to the setup-libtorch.sh failed to take up cuda flag.
Can you please delete libtorch from your system at /usr/local/lib/libtorch (sudo rm -rf /usr/local/lib/libtorch) then either

Download libtorch for cuda 10.2 and unzip it to ``/usr/local/lib/libtorch` or
rerun the fixed setup-libtorch that I will commit shortly

Let's me know how you go. Thanks.

xJun · Answer 2 · Wed Apr 13 2022 10:42:06 GMT+0800 (China Standard Time)

Hello, Thank you for your reply.
Firstly, I'm sure I installed with libtorch for GPU not for CPU, because I already fixed setup-libtorch.sh and program downloaded libtorch-cxx11-abi-shared-with-deps-1.11.0+cu102.zip yesterday.

Secondly, I have tried your suggestions yesterday, but it didn't work. In addition, I try to install gotch in other containers of Docker, but I'm in the same situation.

Thirdly, I installed gotch@v0.4.3 In my computer, it works well. gotch.CUDA.IsAvailable() return ture, but there seems to be a memory leak when I run mnist of example. Memory and video memory have been growing rapidly after each epoch.

sugarme · Answer 3 · Wed Apr 13 2022 16:46:17 GMT+0800 (China Standard Time)

@XiongUp,

Thanks for your feedback. I am currently using CUDA 11 and it's fine. I believe MNIST mem leak that you experienced has been fixed recently (please see update on master branch), otherwise, please report what you find.
Would you mind to share what OS are you using? and what is output of nvidia-smi (I know you use CUDA 10.2). I may be able to set up a machine with CUDA 10.2 to see how it goes and let you know.

xJun · Answer 4 · Wed Apr 13 2022 17:59:19 GMT+0800 (China Standard Time)

Hello~ You are right, MNIST mem leak that I experienced has been fixed recently.

My OS is 18.04.3 LTS (Bionic Beaver)

(base) root@7c8a9f63a67f:/dev/shm/gotorch_projects/test1# cat /etc/os-release 
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

nvidia-smi shows

(base) root@7c8a9f63a67f:/dev/shm/gotorch_projects/test1# nvidia-smi
Wed Apr 13 17:50:07 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 511.04.01    Driver Version: 511.09       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   82C    P2    55W /  N/A |   1064MiB /  6144MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4977      C   /test1                          N/A      |
+-----------------------------------------------------------------------------+

Finally. Thanks for your reply and contribution again.

xJun · Answer 5 · Thu Apr 14 2022 17:34:07 GMT+0800 (China Standard Time)

gotch@v0.6.2 also works well in my new container, but gotch@v0.7.0 still can't use GPU.

sugarme · Answer 6 · Sun Apr 17 2022 11:21:50 GMT+0800 (China Standard Time)

@XiongUp ,

I found my time to setup a machine with cuda 10.2 installed on Debian 10 and RTX3060

Sun Apr 17 13:09:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 35%   40C    P0     1W / 170W |      0MiB / 12053MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

Here is a minimum example that can find CUDA:

package main

import(
        "fmt"

        "github.com/sugarme/gotch/ts"
        "github.com/sugarme/gotch"
)

func main(){
        device := gotch.CudaIfAvailable()
        x := ts.MustOfSlice([]float64{1, 2, 3, 4}).MustTo(device, true).MustView([]int64{2,2}, true)

        fmt.Printf("%0.1f\n", x)
        fmt.Printf("%i\n", x)
}

1.0  2.0  
3.0  4.0  



TENSOR INFO:
        Shape:          [2 2]
        DType:          float64
        Device:         {CUDA 0}
        Defined:        true

Can you try to delete $GOPATH/pkg/mod/github.com/sugarme/gotch@v0.7.0 and try go get github.com/sugarme/gotch@v0.7.0 again?

Try to build with go build . -x and inspect whether CGO links libtorch as expected?

sugarme · Answer 7 · Sun Apr 17 2022 14:19:40 GMT+0800 (China Standard Time)

@XiongUp ,

I setup a Google colab with ubuntu 18.04.3 and Cuda 10.2 same as yours.
gotch v0.7.0 is able to use CUDA. Please see the link below:

https://colab.research.google.com/drive/1ssMOAjOOI1lLt70jZJm_RBCewALfXP4M?usp=sharing

So I believe it is the matter of linking issue in your local machine.

xJun · Answer 8 · Wed Apr 20 2022 17:39:19 GMT+0800 (China Standard Time)

It still doesn't work if I install gotch v0.7.0 follow README.md, but gotch v0.7.0 is able to use CUDA if I install gotch v0.7.0 follow https://colab.research.google.com/drive/1ssMOAjOOI1lLt70jZJm_RBCewALfXP4M?usp=sharing.

$wget -q -o --show-progress --progress=bar:force:noscroll -O /tmp/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu102.zip https://download.pytorch.org/libtorch/cu102/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu102.zip
$unzip -qq -o /tmp/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu102.zip -d /usr/local
$unzip -qq -j -o /tmp/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu102.zip libtorch/lib/* -d /usr/lib64-nvidia/
$unzip -qq -j -o /tmp/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu102.zip libtorch/lib/* -d /usr/local/cuda/lib64/stubs/

.bashrc is added

export GOTCH_LIBTORCH="/usr/local/lib/libtorch"
export LIBRARY_PATH="$LIBRARY_PATH:/usr/local/libtorch/lib"
export CPATH="$CPATH:/usr/local/libtorch/lib:/usr/local/libtorch/include:/usr/local/libtorch/include/torch/csrc/api/include"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib64-nvidia:/usr/local/cuda-10.2/lib64:/usr/local/libtorch/lib"
export CC="clang"
export CXX="clang++"

sugarme · Answer 9 · Thu Apr 21 2022 13:53:41 GMT+0800 (China Standard Time)

@XiongUp ,
Thanks for your report and happy to see you it works (hack way though).

As you mentioned the Google Colab installation way works, so I believe issue related to your local setup and library linking. Probably your CUDA configuration for multiple library (Python Pytorch, Libtorch C++, ...) that I have no idea how to track it down.

I would close this issue for now. Feel free to discuss if you have other thoughts. Thanks.

gotch.CUDA.IsAvailable() return false but cuda is available in fact

1. nvcc -V shows cuda version is 10.2

2. export CUDA_VER=10.2 && bash setup-libtorch.sh is successful.

3. Environment is ok. Following lines have been added to .bashrc file（/usr/lib64-nvidia is not found）

4. export CUDA_VER=10.2 && export GOTCH_VER=v0.7.0 && bash setup-gotch.sh is successful.

5. my example

1. `nvcc -V` shows cuda version is 10.2

2. `export CUDA_VER=10.2 && bash setup-libtorch.sh` is successful.

4. `export CUDA_VER=10.2 && export GOTCH_VER=v0.7.0 && bash setup-gotch.sh` is successful.