The speed of pytorch with cudatoolkit 11.0 is slower than cudatoolkit 10.2
klyjm opened this issue · comments
🐛 Bug
When I update the pytorch to 1.7, the cudatoolkit is updated automaticlly to 11.0, and I find the speed of the same code is slower too much than before. So I change the version of the cudatoolkit back to 10.2, the speed is normal. Maybe I should update the cudnn version in Ubuntu?
To Reproduce
I just use the same code in the same device with the same environment only change the version of the cudatoolkit, the speed is slower too much.
cudatoolkit 10.2
Speed: 13.4/1.3/14.6 ms inference/NMS/total per 640x640 image at batch-size 1
cudatoolkit 11.0
Speed: 27.0/1.2/28.2 ms inference/NMS/total per 640x640 image at batch-size 1
Expected behavior
The speed of 11.0 should be no more slower than 10.2.
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
Collecting environment information...
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V
GPU 2: TITAN V
GPU 3: TITAN V
GPU 4: TITAN V
GPU 5: TITAN V
Nvidia driver version: 450.80.02
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.7.0
[pip3] torchvision==0.8.1
[conda] blas 1.0 mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] cudatoolkit 11.0.221 h6bb024c_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] libblas 3.8.0 20_mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] libcblas 3.8.0 20_mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] liblapack 3.8.0 20_mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] liblapacke 3.8.0 20_mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] mkl 2020.2 256 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl-service 2.3.0 py38he904b0f_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_fft 1.2.0 py38h23d657b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_random 1.1.1 py38h0573a6f_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy 1.19.1 py38hbc911f0_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy-base 1.19.1 py38hfa32c7d_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] pytorch 1.7.0 py3.8_cuda11.0.221_cudnn8.0.3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchvision 0.8.1 py38_cu110 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
Additional context
I can confirm this is the case as well. Recently had a 2x speedup downgrading from CUDA 11 to CUDA 10.2 on a GTX 1080 Ti.
Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Pop!_OS 20.04 LTS (x86_64)
GCC version: (Homebrew GCC 5.5.0_7) 5.5.0
Clang version: 11.0.0
CMake version: version 3.18.4
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.2.89 (11.1 installed by Pop!_OS)
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
Nvidia driver version: 455.38
cuDNN version: 7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] pytorch-lightning==1.0.5
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] blas 2.20 mkl conda-forge
[conda] cudatoolkit 10.2.89 hfd86e86_1 anaconda
[conda] libblas 3.8.0 20_mkl conda-forge
[conda] libcblas 3.8.0 20_mkl conda-forge
[conda] liblapack 3.8.0 20_mkl conda-forge
[conda] liblapacke 3.8.0 20_mkl conda-forge
[conda] mkl 2020.2 256 conda-forge
[conda] numpy 1.19.4 py38hf0fd68c_0 conda-forge
[conda] pytorch 1.6.0 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch
[conda] pytorch-lightning 1.0.5 pyhd8ed1ab_0 conda-forge
[conda] torchvision 0.7.0 py38_cu102 pytorch
This is output within the conda
environment I experienced this speedup in.
@ionlights @klyjm do you know if this is still the case with pytorch 1.7.1 ?
@ionlights @klyjm do you know if this is still the case with pytorch 1.7.1 ?
@LukeAI It is not stable, sometimes the speeds are same, sometimes the 11.0 is slower. So, I still use 10.2.
@ionlights @klyjm do you know if this is still the case with pytorch 1.7.1 ?
It is for me (about 20% slower on CUDA 11.0 compared to to CUDA 10.1).
Here is my first logs (CUDA=10.1):
Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Clang version: Could not collect
CMake version: version 2.8.12.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti
Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.1.243 h036e899_6 conda-forge
[conda] mkl 2020.4 h726a3e6_304 conda-forge
[conda] mkl-service 2.3.0 py38h1e0a361_2 conda-forge
[conda] mkl_fft 1.2.0 py38hab2c0dc_1 conda-forge
[conda] mkl_random 1.2.0 py38hc5bc63f_1 conda-forge
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] pytorch 1.7.1 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] torchaudio 0.7.2 py38 pytorch
[conda] torchvision 0.8.2 py38_cu101 pytorch
Here is my second logs (CUDA=11.0):
Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Clang version: Could not collect
CMake version: version 2.8.12.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti
Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.0.3 h15472ef_6 conda-forge
[conda] mkl 2020.4 h726a3e6_304 conda-forge
[conda] mkl-service 2.3.0 py38h1e0a361_2 conda-forge
[conda] mkl_fft 1.2.0 py38hab2c0dc_1 conda-forge
[conda] mkl_random 1.2.0 py38hc5bc63f_1 conda-forge
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] pytorch 1.7.1 py3.8_cuda11.0.221_cudnn8.0.5_0 pytorch
[conda] torchaudio 0.7.2 py38 pytorch
[conda] torchvision 0.8.2 py38_cu110 pytorch
@ionlights @klyjm do you know if this is still the case with pytorch 1.7.1 ?
It is for me (about 20% slower on CUDA 11.0 compared to to CUDA 10.1).
Here is my first logs (CUDA=10.1):
Collecting environment information... PyTorch version: 1.7.1 Is debug build: False CUDA used to build PyTorch: 10.1 ROCM used to build PyTorch: N/A OS: CentOS Linux 7 (Core) (x86_64) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36) Clang version: Could not collect CMake version: version 2.8.12.2 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti GPU 3: GeForce RTX 2080 Ti GPU 4: GeForce RTX 2080 Ti GPU 5: GeForce RTX 2080 Ti GPU 6: GeForce RTX 2080 Ti GPU 7: GeForce RTX 2080 Ti Nvidia driver version: 450.80.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.19.2 [pip3] torch==1.7.1 [pip3] torchaudio==0.7.0a0+a853dff [pip3] torchvision==0.8.2 [conda] blas 1.0 mkl [conda] cudatoolkit 10.1.243 h036e899_6 conda-forge [conda] mkl 2020.4 h726a3e6_304 conda-forge [conda] mkl-service 2.3.0 py38h1e0a361_2 conda-forge [conda] mkl_fft 1.2.0 py38hab2c0dc_1 conda-forge [conda] mkl_random 1.2.0 py38hc5bc63f_1 conda-forge [conda] numpy 1.19.2 py38h54aff64_0 [conda] numpy-base 1.19.2 py38hfa32c7d_0 [conda] pytorch 1.7.1 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch [conda] torchaudio 0.7.2 py38 pytorch [conda] torchvision 0.8.2 py38_cu101 pytorch
Here is my second logs (CUDA=11.0):
Collecting environment information... PyTorch version: 1.7.1 Is debug build: False CUDA used to build PyTorch: 11.0 ROCM used to build PyTorch: N/A OS: CentOS Linux 7 (Core) (x86_64) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36) Clang version: Could not collect CMake version: version 2.8.12.2 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti GPU 3: GeForce RTX 2080 Ti GPU 4: GeForce RTX 2080 Ti GPU 5: GeForce RTX 2080 Ti GPU 6: GeForce RTX 2080 Ti GPU 7: GeForce RTX 2080 Ti Nvidia driver version: 450.80.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.19.2 [pip3] torch==1.7.1 [pip3] torchaudio==0.7.0a0+a853dff [pip3] torchvision==0.8.2 [conda] blas 1.0 mkl [conda] cudatoolkit 11.0.3 h15472ef_6 conda-forge [conda] mkl 2020.4 h726a3e6_304 conda-forge [conda] mkl-service 2.3.0 py38h1e0a361_2 conda-forge [conda] mkl_fft 1.2.0 py38hab2c0dc_1 conda-forge [conda] mkl_random 1.2.0 py38hc5bc63f_1 conda-forge [conda] numpy 1.19.2 py38h54aff64_0 [conda] numpy-base 1.19.2 py38hfa32c7d_0 [conda] pytorch 1.7.1 py3.8_cuda11.0.221_cudnn8.0.5_0 pytorch [conda] torchaudio 0.7.2 py38 pytorch [conda] torchvision 0.8.2 py38_cu110 pytorch
@feinsteinben Yeah, I have just test on my code, the speed of 1.7.1 in cuda 11.0 is about 25% slower than 10.2, more faster than 1.7.0, but still slower.
Hi, I'm also facing the same issue (tried on A100 GPUs which I think need cuda >= 11). Was anybody able to overcome this issue? Thanks
Hi, I'm also facing the same issue on 1080 when debugging a Libtorch segmentation model. The pytorch branch is v1.7.1.
CUDA10.1-cudnn7 | CUDA11.1-cudnn8 |
---|---|
187ms | 320ms |
Hi, I'm also facing the same issue on 1080 when debugging a Libtorch segmentation model. The pytorch branch is v1.7.1.
CUDA10.1-cudnn7 CUDA11.1-cudnn8
187ms 320ms
The speed is really still slower when using CUDA 11, I don't know what makes.
Hi, I'm also facing the same issue on 1080 when debugging a Libtorch segmentation model. The pytorch branch is v1.7.1.
CUDA10.1-cudnn7 CUDA11.1-cudnn8
187ms 320msThe speed is really still slower when using CUDA 11, I don't know what makes.
Have you tried to build your own PyTorch with cuda11.1(cuda11.2 is released but no cudnn support yet.) or use daily built PyTorch?
My colleague duplicated my comparision on RTX 2080Ti, but no difference is observed. I guess this is related to the device arch.
I am also getting a 2x slowdown with cuda 11 vs 10.2 on pytorch 1.7.1 on a GTX1080Ti.
Same problem on ubuntu 18.04 using titan rtx. Almost 2x speed up on batch size 5 when using conda and downgrading from
pytorch-1.7.1-py3.7_cuda11.0.221_cudnn8.0.5_0
to
pytorch-1.7.0-py3.7_cuda10.2.89_cudnn7.6.5_0.
Driver version is 460. Should i perhaps downgrade to 455? cudnn support matrix confuses me a bit.
edit: tested the 1.8 nightly, which came with cuda 11.0 and cudnn 8.0.3, and did not encounter speed issues.
Is it happening also with Cuda 11.2 (supported by cudnn 8.1.0 since January 26th)?
Is it happening also with Cuda 11.2 (supported by cudnn 8.1.0 since January 26th)?
This is not supported by 1.8.0 until now, so I haven't tried it.
Same problem on ubuntu 18.04 using titan rtx. Almost 2x speed up on batch size 5 when using conda and downgrading from
pytorch-1.7.1-py3.7_cuda11.0.221_cudnn8.0.5_0
to
pytorch-1.7.0-py3.7_cuda10.2.89_cudnn7.6.5_0.
Driver version is 460. Should i perhaps downgrade to 455? cudnn support matrix confuses me a bit.edit: tested the 1.8 nightly, which came with cuda 11.0 and cudnn 8.0.3, and did not encounter speed issues.
I just try 1.8 with 11.1, still slow about 10%~15%.
1.8 with 11.1 about 40%~45% slower than 1.8 with 10.2.
If some of the benchmarks mentioned above are public, can someone post a concrete examples?
I've run simple builtin https://github.com/pytorch/pytorch/blob/master/benchmarks/fastrnns/bench.py and results looks pretty similar across 10.2 and 11.1 toolkits on single RTX 2080 GPU.
With CUDA-11.1:
Namespace(cnns=None, cuda_pointwise_block_count=None, cuda_pointwise_block_size=None, cuda_pointwise_loop_level=None, device='cuda', executor=None, fuser='te', group=['cnns', 'rnns'], hiddenSize=512, inputSize=512, miniBatch=64, nloops=100, numLayers=1, print_json=None, rnns=None, sep=' ', seqLength=100, variable_lstms=False, warmup=10)
Benchmarking LSTMs...
name avg_fwd std_fwd info_fwd avg_bwd std_bwd info_bwd
cudnn 7.056 0.1674 None 10.51 0.05994 None
aten 9.4 0.1954 None 22.74 2.145 None
jit 7.753 0.04015 None 22.83 2.266 None
jit_premul 6.892 0.02701 None 21.78 1.189 None
jit_premul_bias 7.118 0.03285 None 21.68 0.9731 None
jit_simple 7.661 0.07742 None 22.97 1.975 None
jit_multilayer 7.771 0.0539 None 23.4 2.204 None
py 15.72 0.2226 None 27.62 3.523 None
Benchmarking ResNets...
name avg_fwd std_fwd info_fwd avg_bwd std_bwd info_bwd
resnet18 15.88 0.02621 None 33.65 0.06821 None
resnet18_jit 15.93 0.03301 None 33.75 0.092 None
resnet50 52.15 0.09712 None 110.7 0.31 None
resnet50_jit 52.24 0.09196 None 111.2 0.2458 None
With CUDA-10.2:
Namespace(cnns=None, cuda_pointwise_block_count=None, cuda_pointwise_block_size=None, cuda_pointwise_loop_level=None, device='cuda', executor=None, fuser='te', group=['cnns', 'rnns'], hiddenSize=512, inputSize=512, miniBatch=64, nloops=100, numLayers=1, print_json=None, rnns=None, sep=' ', seqLength=100, variable_lstms=False, warmup=10)
Benchmarking LSTMs...
name avg_fwd std_fwd info_fwd avg_bwd std_bwd info_bwd
cudnn 6.972 0.08801 None 10.11 0.05425 None
aten 8.679 0.3157 None 22.1 2.322 None
jit 7.735 0.1454 None 23.48 3.46 None
jit_premul 6.851 0.01975 None 21.04 1.09 None
jit_premul_bias 7.082 0.0167 None 21.21 1.128 None
jit_simple 7.708 0.03711 None 20.87 1.761 None
jit_multilayer 7.813 0.04202 None 21.66 2.1 None
py 13.14 0.4173 None 25.95 2.197 None
Benchmarking ResNets...
name avg_fwd std_fwd info_fwd avg_bwd std_bwd info_bwd
resnet18 16.96 0.03068 None 34.88 0.08062 None
resnet18_jit 16.99 0.02914 None 34.96 0.07073 None
resnet50 54.41 0.05765 None 113.8 0.1582 None
resnet50_jit 54.63 0.1907 None 114.6 0.7369 None
@malfet Based on the reports (self included), it seems like NVIDIA GPUs which lack tensor cores are affected (or maybe it's just the 10-series).
I should have some time today to run those benchmarks, though. Did you use the default arguments?
@jmuchovej yes, just run it as python -m farstrnns.bench
@jmuchovej Not yet. I have tried on TITAN V which has tensor cores, still slow. And some people use 2080Ti
also face the same question. The 10 series are slower than 20 series and titan v, but still slow. I think this may depends on the task. I just use yolov5 recently, so I only try it. I have research a little, it seems like more time spend on conv, may be the conv is diffirent?
I don't keep my previous builds, so I don't have comparable benchmark results, but the situation for me with a RTX 2060 was like this:
I saw a huge performance boost for especially mixed training from pytorch 1.6.0-cuda 10.2 to pytorch 1.7.1-cuda 11.0. Normal training was the same. From pytorch 1.7.1-cuda 11.0 to pytorch 1.8.0-cuda 11.1, I've lost around 15-20% for both mixed and normal training.
These results are very similar on both Windows and Ubuntu.
@malfet I can't reproduce the slowdown with your benchmark. I am not sure why it doesn't show up.
But on my own repo I still see a 40% slowdown with pytorch 1.8 and cudatoolkit 11.1. It's mostly just a resnet with a double backward pass.
One epoch is ~1:20 with pytorch 1.8 and cudatoolkit 10.2, it's ~1:50 with cudatoolkit 11.1
This was tested on a GTX1080Ti. Everything installed through conda as described on pytorch.org.
@y0ast thank you for the link to repro, will look into this one
Same problem on 10 series GPU. Pytorch 1.8.1 with py3.9_cuda11.1_cudnn8_0
is around 30-40% slower than Pytorch 1.6.0 with py3.7_cuda102_cudnn7_0
.
Same problem on 940M GPU
Unfortunately this does not seem to be fixed:
PyTorch version + cudatoolkit version
1.8.1 + 10.2: 1m45s per epoch
1.8.1 + 11.1: 2m20s
1.9 + 11.1: 2m20s
1.9 + 10.2: 1m45s
All obtained on GTX 1080Ti, using this repo https://github.com/y0ast/deterministic-uncertainty-quantification (numbers are a bit different than my last comment due to counting an epoch different, but the relative difference remains)
PyTorch 1.9 for cudatoolkit 10.2 is 705MB, while it's 1.44GB for cudatoolkit 11.1. That seems like an unusually big difference.
@y0ast what is the command line to run this benchmark?
@ngimel On the master branch, just:
python train_duq_cifar.py --final_model
It'll download cifar10 on the fly and prints a tqdm progress bar with epoch timing.
I created the environments for pytorch=1.9 like this:
conda install pytorch torchvision cudatoolkit=11.1 -c pytorch -c nvidia
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch -c nvidia
@klyjm I met the simliar problem in our new server with 8 x A100 GPU. The difference is that my model running slower only on DDP model and normal on DP mode. After my debugging, I found that the slower operation is loss.backward()
The environment I use is:
`Collecting environment information...
PyTorch version: 1.7.0+cu110
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: A100-PCIE-40GB
GPU 1: A100-PCIE-40GB
GPU 2: A100-PCIE-40GB
GPU 3: A100-PCIE-40GB
GPU 4: A100-PCIE-40GB
GPU 5: A100-PCIE-40GB
GPU 6: A100-PCIE-40GB
GPU 7: A100-PCIE-40GB
Nvidia driver version: 450.119.04
cuDNN version: Probably one of the following:
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.7.0+cu110
[pip3] torchaudio==0.7.0
[pip3] torchfile==0.1.0
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.8.1+cu110
[conda] numpy 1.19.5 pypi_0 pypi
[conda] torch 1.7.0+cu110 pypi_0 pypi
[conda] torchaudio 0.7.0 pypi_0 pypi
[conda] torchfile 0.1.0 pypi_0 pypi
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.8.1+cu110 pypi_0 pypi
`
It is about 3 times slower than Tesla V100 GPU( CUDA 10.2 , pytorch 1.6 ) when I run the same code.
A100 training takes 6.01 s,
V100 training takes 1.89 s.
According to the official document of NVIDIA, TFLOPs for A100 and V100 is 19 and 15 respectively, which means A100 should run faster than V100, I am really confused about the result.
Since CUDA10.x is not supported by A100 with Amperer Arch, I didn't test A100 with CUDA10.x
@klyjm I met the same problem in our new server with 8 x A100 GPU, the environment I use is:
`Collecting environment information...
PyTorch version: 1.7.0+cu110
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/AOS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collectPython version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: A100-PCIE-40GB
GPU 1: A100-PCIE-40GB
GPU 2: A100-PCIE-40GB
GPU 3: A100-PCIE-40GB
GPU 4: A100-PCIE-40GB
GPU 5: A100-PCIE-40GB
GPU 6: A100-PCIE-40GB
GPU 7: A100-PCIE-40GBNvidia driver version: 450.119.04
cuDNN version: Probably one of the following:
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/AVersions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.7.0+cu110
[pip3] torchaudio==0.7.0
[pip3] torchfile==0.1.0
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.8.1+cu110
[conda] numpy 1.19.5 pypi_0 pypi
[conda] torch 1.7.0+cu110 pypi_0 pypi
[conda] torchaudio 0.7.0 pypi_0 pypi
[conda] torchfile 0.1.0 pypi_0 pypi
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.8.1+cu110 pypi_0 pypi
`
It is about 3 times slower than Tesla V100 GPU( CUDA 10.2 , pytorch 1.6 ) when I run the same code.
A100 training takes 6.01 s,
V100 training takes 1.89 s.According to the official document of NVIDIA, TFLOPs for A100 and V100 is 19 and 15 respectively, which means A100 should run faster than V100, I am really confused about the result.
Since CUDA10.x is not supported by A100 with Amperer Arch, I didn't test A100 with CUDA10.x
@graycrown Yeah, and I have test PyTorch 1.9, there is no different. This bug is still unsolved.
It seems to me this is related to cudnn.
I'm using pytorch 1.8.1 py3.8_cuda11.1_cudnn8.0.8_0 managed by Conda. I tried to benchmark training performance on GTX 1080 ti, RTX 2080, and RTX 3090.
When I have torch.backends.cudnn.enabled=False
, the training performance went up by around 30%. The result is pretty consistent across all these three GPUs.
@maxwxzheng can you try updating to PyTorch-1.9 and compare the performance? (Several linking issues that could have negatively affect the CuDNN performance were fixed in 1.9)
@maxwxzheng can you try updating to PyTorch-1.9 and compare the performance? (Several linking issues that could have negatively affect the CuDNN performance were fixed in 1.9)
@malfet I have tested my code in PyTorch 1.9, there is no difference. The PyTorch with cuda 11.1 + cudnn 8.x is still slower than cuda 10.2 + cudnn 7.x
I tested with pytorch 1.9 on RTX 3090. No difference as well. Training with cudnn on is still about 30 to 40 % slower than cudnn off.
@klyjm, @maxwxzheng what benchmark are you running?
I can not reproduce perf difference on GTX2080 by running python train_duq_cifar.py --final_model
, it returns 2m4s per iteration regardless of whether CUDA-10.2 or CUDA-11.1 PyTorch binary is used
@malfet Just like shown in the begin, I use test.py in ultralytics/yolov5 to test the speed. After the upgrade to cuda 11 and cudnn 8, the speed is always slower than use cuda 10 and cudnn 7
@klyjm ok, I can observe perf degradation between CUDA-11 and CUDA-10.2 for batchsize 1, but if batch size is larger, trend is inverse:
CUDA ver | BatchSize | Inference time |
---|---|---|
10.2 | 1 | 4.5ms |
11.1 | 1 | 9.2ms |
10.2 | 64 | 4.3ms |
11.1 | 64 | 3.9ms |
@malfet Yeah, you are right. The bigger batch size, the smaller difference. And I also find that set torch.backends.cudnn.enabled=False
can really acclerate the code with cuda 11.1, but will slower with cuda 10.2
I did some benchmark across different pytorch versions.
My network architecture is efficientnetb0 + fpn. I used this library: https://github.com/qubvel/segmentation_models.pytorch
It seems training speed with cudnn on has been decreasing for this architecture. (Mem is peak memory usage. Time is total time spent on training.)
cc @ptrblck, can you reproduce these results?
I couldn't find any examples how to use the posted repository, but since it seems to reuse the timm
repo, I just used efficientnet_b0
from it directly and couldn't reproduce the issue:
source build, cudnn8.2.2
cudnn.enabled=False, cudnn.benchmark=False, 20.43099s
cudnn.enabled=True, cudnn.benchmark=False, 17.86255s
cudnn.enabled=True, cudnn.benchmark=True, 17.97606s
1.9.0+cu111, cudnn8.0.5
cudnn.enabled=False, cudnn.benchmark=False, 21.61472s
cudnn.enabled=True, cudnn.benchmark=False, 19.17168s
cudnn.enabled=True, cudnn.benchmark=True, 19.05530s
1.8.1+cu111, cudnn8.0.5
cudnn.enabled=False, cudnn.benchmark=False, 21.60732s
cudnn.enabled=True, cudnn.benchmark=False, 19.55659s
cudnn.enabled=True, cudnn.benchmark=True, 19.03868s
1.7.1+cu110, cudnn8.0.5
cudnn.enabled=False, cudnn.benchmark=False, 45.85837s
cudnn.enabled=True, cudnn.benchmark=False, 19.83326s
cudnn.enabled=True, cudnn.benchmark=True, 19.36150s
Code:
import torch
import torch.nn as nn
import time
import timm
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True
model = timm.create_model('efficientnet_b0').cuda()
x = torch.randn(12, 3, 960, 640, device='cuda')
# warmup
for _ in range(10):
out = model(x)
out.backward(torch.ones_like(out))
grad = torch.ones_like(out)
nb_iters = 100
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
out = model(x)
out.backward(grad)
torch.cuda.synchronize()
t1 = time.perf_counter()
print('cudnn.enabled={}, cudnn.benchmark={}, {:.5f}s'.format(
torch.backends.cudnn.enabled, torch.backends.cudnn.benchmark, (t1 - t0)))
Thanks @ngimel and @ptrblck for looking into this.
I think maybe one of the two things are happening:
- I'm doing something stupid in my code.
I tried to profile the training with with torch.profiler. I got the following result:
It looks likeaten::cudnn_convolution_backward_input
is taking a long time. Is this common? What would cause this step to take such a long time? - Somehow when I install pytorch 1.8+, conda automatically downgrade torchvision to 0.2.2. Would this older version of torchvision cause this slowdown?
@ptrblck hello, the same issue on Ampere RTX 3090. Downgrade of speed in tensor loading in 2 times whith CUDA 11.3 + libtorch_cu113 ~6 sec, instead CUDA10.2.+libtorch cu102 ~3 sec.
There is a problem yet. is there a solution?
@AlexTitovWork could you describe a bit more what "speed in tensor loading" means?
Are you measuring the time to transfer the tensor from CPU to the GPU and are seeing a slowdown?
Hello @ptrblck ! I use simple test for upload data in to GPU tensor under docker container.
int height =400; // color img size in px
int width = 400;
torch::Tensor tensor_image = torch::zeros({1,height,width,3});
tensor_image = tensor_image.pin_memory();
torch::Tensor tensor_image = torch::from_blob(input.data, {1, input.rows, input.cols, 3}, torch::kByte);
tensor_image = tensor_image.permute({0, 3, 1, 2});
tensor_image = tensor_image.toType(torch::kFloat);
tensor_image = tensor_image.div(255);
tensor_image = tensor_image.to(torch::kCUDA);
I use same test for two GPU platform
Allocationg + data transfering:
- on RTX 2080 Ti under:
CUDA Driver version/ Runtime version 11.2/ 10.2
CUDA Capability Major/Minor version number 7.5
Takes ~3 sec. - on RTX 3090 under:
CUDA Driver version/ Runtime version 11.4/ 11.3
CUDA Capability Major/Minor version number 8.6
Takes ~6 sec.
This is not such a big code that in the first case it takes ~3 seconds to allocate memory and in the second a terrible ~6.
Moreover, most of the time is taken by device initialization and memory allocation.
It seems to be a problem with the docker I'm using.
Yes, very slow while loading the model parameter and memory allodation for cuda 11.5 and torch 1.3.1. Testing on cuda 10.1, I hope it will work.
Hello! I found next information about loading and memory allocation at the first start Libtorch or pyTorch
From here https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#troubleshooting
Q: Why is cuDNN version 8.0 convolution API call much slower on the first call than subsequent calls?
A: Due to the library split, cuDNN version 8.0 API will only load the necessary kernels on the first API call that requires it. In previous versions, this load would have been observed in the first cuDNN API call that triggers CUDA context initialization, typically cudnnCreate(). In version 8.0, this is delayed until the first sub-library call that triggers CUDA context initialization. Users who desire to have CUDA context preloaded can call the new cudnnCnnInferVersionCheck() API (or its related cousins), which has the side effect of initializing a CUDA context. This will reduce the run time for all subsequent API calls.
Same observation for torch 1.8.2 on a 2080ti machine. The overall speed of one of my training job is about 20% slower with cu11 than cu10.2.
I also still observe this with pytorch 1.11.0
With https://github.com/y0ast/DUE
python train_due.py
~2 minutes on cudatoolkit 10.2, ~4 minutes per epoch on cudatoolkit 11.3
Reproduced on two different machines with a 1080Ti (driver 510.47) and Titan Xp (driver 510.68).
I've gone over @ptrblck's example to see where the difference comes from and just adding:
torch.backends.cudnn.benchmark = True
Makes my epoch go from 4:55 to 1:57 with newer CUDA/CuDNN versions on the codebase I linked above. This is the same as it was with CUDA 10.2 (and CuDNN 7+).
My hypothesis is that in CuDNN 8+ the default convolution algorithm changed. This change is probably fine for newer hardware, but runs badly on older hardware. By setting benchmark to true, CuDNN is forced to re-evaluate that choice and finds that the old choice is better.
Same here, @y0ast didn't change anything for me
Same here. pytorch==1.10.1+cu111
is 4 times faster than pytorch==1.10.1+cu113
on A100 machine with cudatoolkit=11.4
.
"Same here" is unfortunately not actionable.
@namespace-Pt this issue also discussed the difference between CUDA10.2 vs. 11.0, while you are using 11.x releases, so please feel free to create a new issues providing a minimal, executable code snippet as well as your system information as asked in the bug template.