gotch.CUDA.IsAvailable() return false but cuda is available in fact
XiongUp opened this issue · comments
1. nvcc -V
shows cuda version is 10.2
(base) root@7c8a9f63a67f:/dev/shm/gotorch_projects/test1# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
2. export CUDA_VER=10.2 && bash setup-libtorch.sh
is successful.
3. Environment is ok. Following lines have been added to .bashrc file(/usr/lib64-nvidia is not found)
export GOTCH_LIBTORCH="/usr/local/lib/libtorch"
export LIBRARY_PATH="$LIBRARY_PATH:$GOTCH_LIBTORCH/lib"
export CPATH="$CPATH:$GOTCH_LIBTORCH/lib:$GOTCH_LIBTORCH/include:$GOTCH_LIBTORCH/include/torch/csrc/api/include"
LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$GOTCH_LIBTORCH/lib:/usr/local/cuda-10.2/lib64"
4. export CUDA_VER=10.2 && export GOTCH_VER=v0.7.0 && bash setup-gotch.sh
is successful.
5. my example
package main
import (
"fmt"
"github.com/sugarme/gotch"
)
func main() {
if gotch.CUDA.IsAvailable() {
fmt.Println("CUDA is available")
} else {
fmt.Println("No Find")
}
}
Result shows "No Find", because gotch.CUDA.IsAvailable() return false.
In addition, I can train the neural network using the GPU by gotorch (https://github.com/wangkuiyi/gotorch) or pytorch. So cuda is available in fact!
I suspect your system was installed with libtorch
for CPU due to the setup-libtorch.sh
failed to take up cuda
flag.
Can you please delete libtorch
from your system at /usr/local/lib/libtorch
(sudo rm -rf /usr/local/lib/libtorch) then either
- Download libtorch for cuda 10.2 and unzip it to ``/usr/local/lib/libtorch` or
- rerun the fixed
setup-libtorch
that I will commit shortly
Let's me know how you go. Thanks.
Hello, Thank you for your reply.
Firstly, I'm sure I installed with libtorch for GPU not for CPU, because I already fixed setup-libtorch.sh
and program downloaded libtorch-cxx11-abi-shared-with-deps-1.11.0+cu102.zip
yesterday.
Secondly, I have tried your suggestions yesterday, but it didn't work. In addition, I try to install gotch in other containers of Docker, but I'm in the same situation.
Thirdly, I installed gotch@v0.4.3 In my computer, it works well. gotch.CUDA.IsAvailable() return ture, but there seems to be a memory leak when I run mnist of example. Memory and video memory have been growing rapidly after each epoch.
Thanks for your feedback. I am currently using CUDA 11 and it's fine. I believe MNIST mem leak that you experienced has been fixed recently (please see update on master branch), otherwise, please report what you find.
Would you mind to share what OS are you using? and what is output of nvidia-smi
(I know you use CUDA 10.2). I may be able to set up a machine with CUDA 10.2 to see how it goes and let you know.
Hello~ You are right, MNIST mem leak that I experienced has been fixed recently.
- My OS is 18.04.3 LTS (Bionic Beaver)
(base) root@7c8a9f63a67f:/dev/shm/gotorch_projects/test1# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
- nvidia-smi shows
(base) root@7c8a9f63a67f:/dev/shm/gotorch_projects/test1# nvidia-smi
Wed Apr 13 17:50:07 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 511.04.01 Driver Version: 511.09 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| N/A 82C P2 55W / N/A | 1064MiB / 6144MiB | 23% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4977 C /test1 N/A |
+-----------------------------------------------------------------------------+
Finally. Thanks for your reply and contribution again.
gotch@v0.6.2 also works well in my new container, but gotch@v0.7.0 still can't use GPU.
@XiongUp ,
I found my time to setup a machine with cuda 10.2 installed on Debian 10 and RTX3060
Sun Apr 17 13:09:03 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 35% 40C P0 1W / 170W | 0MiB / 12053MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
Here is a minimum example that can find CUDA:
package main
import(
"fmt"
"github.com/sugarme/gotch/ts"
"github.com/sugarme/gotch"
)
func main(){
device := gotch.CudaIfAvailable()
x := ts.MustOfSlice([]float64{1, 2, 3, 4}).MustTo(device, true).MustView([]int64{2,2}, true)
fmt.Printf("%0.1f\n", x)
fmt.Printf("%i\n", x)
}
1.0 2.0
3.0 4.0
TENSOR INFO:
Shape: [2 2]
DType: float64
Device: {CUDA 0}
Defined: true
Can you try to delete $GOPATH/pkg/mod/github.com/sugarme/gotch@v0.7.0
and try go get github.com/sugarme/gotch@v0.7.0
again?
Try to build with go build . -x
and inspect whether CGO links libtorch as expected?
@XiongUp ,
I setup a Google colab with ubuntu 18.04.3 and Cuda 10.2 same as yours.
gotch v0.7.0
is able to use CUDA. Please see the link below:
https://colab.research.google.com/drive/1ssMOAjOOI1lLt70jZJm_RBCewALfXP4M?usp=sharing
So I believe it is the matter of linking issue in your local machine.
It still doesn't work if I install gotch v0.7.0 follow README.md, but gotch v0.7.0 is able to use CUDA if I install gotch v0.7.0 follow https://colab.research.google.com/drive/1ssMOAjOOI1lLt70jZJm_RBCewALfXP4M?usp=sharing.
$wget -q -o --show-progress --progress=bar:force:noscroll -O /tmp/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu102.zip https://download.pytorch.org/libtorch/cu102/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu102.zip
$unzip -qq -o /tmp/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu102.zip -d /usr/local
$unzip -qq -j -o /tmp/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu102.zip libtorch/lib/* -d /usr/lib64-nvidia/
$unzip -qq -j -o /tmp/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcu102.zip libtorch/lib/* -d /usr/local/cuda/lib64/stubs/
.bashrc is added
export GOTCH_LIBTORCH="/usr/local/lib/libtorch"
export LIBRARY_PATH="$LIBRARY_PATH:/usr/local/libtorch/lib"
export CPATH="$CPATH:/usr/local/libtorch/lib:/usr/local/libtorch/include:/usr/local/libtorch/include/torch/csrc/api/include"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib64-nvidia:/usr/local/cuda-10.2/lib64:/usr/local/libtorch/lib"
export CC="clang"
export CXX="clang++"
@XiongUp ,
Thanks for your report and happy to see you it works (hack way though).
As you mentioned the Google Colab installation way works, so I believe issue related to your local setup and library linking. Probably your CUDA configuration for multiple library (Python Pytorch, Libtorch C++, ...) that I have no idea how to track it down.
I would close this issue for now. Feel free to discuss if you have other thoughts. Thanks.