pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Home Page:https://pytorch.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The speed of pytorch with cudatoolkit 11.0 is slower than cudatoolkit 10.2

klyjm opened this issue · comments

commented

🐛 Bug

When I update the pytorch to 1.7, the cudatoolkit is updated automaticlly to 11.0, and I find the speed of the same code is slower too much than before. So I change the version of the cudatoolkit back to 10.2, the speed is normal. Maybe I should update the cudnn version in Ubuntu?

To Reproduce

I just use the same code in the same device with the same environment only change the version of the cudatoolkit, the speed is slower too much.

cudatoolkit 10.2
Speed:  13.4/1.3/14.6 ms inference/NMS/total per 640x640 image at batch-size 1
cudatoolkit 11.0
Speed: 27.0/1.2/28.2 ms inference/NMS/total per 640x640 image at batch-size 1

Expected behavior

The speed of 11.0 should be no more slower than 10.2.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

Collecting environment information...
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V
GPU 2: TITAN V
GPU 3: TITAN V
GPU 4: TITAN V
GPU 5: TITAN V

Nvidia driver version: 450.80.02
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.7.0
[pip3] torchvision==0.8.1
[conda] blas                      1.0                         mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] cudatoolkit               11.0.221             h6bb024c_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] libblas                   3.8.0                    20_mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] libcblas                  3.8.0                    20_mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] liblapack                 3.8.0                    20_mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] liblapacke                3.8.0                    20_mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] mkl                       2020.2                      256    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl-service               2.3.0            py38he904b0f_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_fft                   1.2.0            py38h23d657b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl_random                1.1.1            py38h0573a6f_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy                     1.19.1           py38hbc911f0_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy-base                1.19.1           py38hfa32c7d_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] pytorch                   1.7.0           py3.8_cuda11.0.221_cudnn8.0.3_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchvision               0.8.1                py38_cu110    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch

Additional context

cc @ngimel @VitalyFedyunin

I can confirm this is the case as well. Recently had a 2x speedup downgrading from CUDA 11 to CUDA 10.2 on a GTX 1080 Ti.

Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Pop!_OS 20.04 LTS (x86_64)
GCC version: (Homebrew GCC 5.5.0_7) 5.5.0
Clang version: 11.0.0
CMake version: version 3.18.4

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.2.89 (11.1 installed by Pop!_OS)
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 455.38
cuDNN version: 7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] pytorch-lightning==1.0.5
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] blas                      2.20                        mkl    conda-forge
[conda] cudatoolkit               10.2.89              hfd86e86_1    anaconda
[conda] libblas                   3.8.0                    20_mkl    conda-forge
[conda] libcblas                  3.8.0                    20_mkl    conda-forge
[conda] liblapack                 3.8.0                    20_mkl    conda-forge
[conda] liblapacke                3.8.0                    20_mkl    conda-forge
[conda] mkl                       2020.2                      256    conda-forge
[conda] numpy                     1.19.4           py38hf0fd68c_0    conda-forge
[conda] pytorch                   1.6.0           py3.8_cuda10.2.89_cudnn7.6.5_0    pytorch
[conda] pytorch-lightning         1.0.5              pyhd8ed1ab_0    conda-forge
[conda] torchvision               0.7.0                py38_cu102    pytorch

This is output within the conda environment I experienced this speedup in.

@ionlights @klyjm do you know if this is still the case with pytorch 1.7.1 ?

commented

@ionlights @klyjm do you know if this is still the case with pytorch 1.7.1 ?

@LukeAI It is not stable, sometimes the speeds are same, sometimes the 11.0 is slower. So, I still use 10.2.

@ionlights @klyjm do you know if this is still the case with pytorch 1.7.1 ?

It is for me (about 20% slower on CUDA 11.0 compared to to CUDA 10.1).

Here is my first logs (CUDA=10.1):

Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Clang version: Could not collect
CMake version: version 2.8.12.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti

Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.1.243             h036e899_6    conda-forge
[conda] mkl                       2020.4             h726a3e6_304    conda-forge
[conda] mkl-service               2.3.0            py38h1e0a361_2    conda-forge
[conda] mkl_fft                   1.2.0            py38hab2c0dc_1    conda-forge
[conda] mkl_random                1.2.0            py38hc5bc63f_1    conda-forge
[conda] numpy                     1.19.2           py38h54aff64_0
[conda] numpy-base                1.19.2           py38hfa32c7d_0
[conda] pytorch                   1.7.1           py3.8_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchaudio                0.7.2                      py38    pytorch
[conda] torchvision               0.8.2                py38_cu101    pytorch

Here is my second logs (CUDA=11.0):

Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Clang version: Could not collect
CMake version: version 2.8.12.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti

Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.0.3               h15472ef_6    conda-forge
[conda] mkl                       2020.4             h726a3e6_304    conda-forge
[conda] mkl-service               2.3.0            py38h1e0a361_2    conda-forge
[conda] mkl_fft                   1.2.0            py38hab2c0dc_1    conda-forge
[conda] mkl_random                1.2.0            py38hc5bc63f_1    conda-forge
[conda] numpy                     1.19.2           py38h54aff64_0
[conda] numpy-base                1.19.2           py38hfa32c7d_0
[conda] pytorch                   1.7.1           py3.8_cuda11.0.221_cudnn8.0.5_0    pytorch
[conda] torchaudio                0.7.2                      py38    pytorch
[conda] torchvision               0.8.2                py38_cu110    pytorch
commented

@ionlights @klyjm do you know if this is still the case with pytorch 1.7.1 ?

It is for me (about 20% slower on CUDA 11.0 compared to to CUDA 10.1).

Here is my first logs (CUDA=10.1):

Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Clang version: Could not collect
CMake version: version 2.8.12.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti

Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.1.243             h036e899_6    conda-forge
[conda] mkl                       2020.4             h726a3e6_304    conda-forge
[conda] mkl-service               2.3.0            py38h1e0a361_2    conda-forge
[conda] mkl_fft                   1.2.0            py38hab2c0dc_1    conda-forge
[conda] mkl_random                1.2.0            py38hc5bc63f_1    conda-forge
[conda] numpy                     1.19.2           py38h54aff64_0
[conda] numpy-base                1.19.2           py38hfa32c7d_0
[conda] pytorch                   1.7.1           py3.8_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchaudio                0.7.2                      py38    pytorch
[conda] torchvision               0.8.2                py38_cu101    pytorch

Here is my second logs (CUDA=11.0):

Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Clang version: Could not collect
CMake version: version 2.8.12.2

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti

Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.0.3               h15472ef_6    conda-forge
[conda] mkl                       2020.4             h726a3e6_304    conda-forge
[conda] mkl-service               2.3.0            py38h1e0a361_2    conda-forge
[conda] mkl_fft                   1.2.0            py38hab2c0dc_1    conda-forge
[conda] mkl_random                1.2.0            py38hc5bc63f_1    conda-forge
[conda] numpy                     1.19.2           py38h54aff64_0
[conda] numpy-base                1.19.2           py38hfa32c7d_0
[conda] pytorch                   1.7.1           py3.8_cuda11.0.221_cudnn8.0.5_0    pytorch
[conda] torchaudio                0.7.2                      py38    pytorch
[conda] torchvision               0.8.2                py38_cu110    pytorch

@feinsteinben Yeah, I have just test on my code, the speed of 1.7.1 in cuda 11.0 is about 25% slower than 10.2, more faster than 1.7.0, but still slower.

Hi, I'm also facing the same issue (tried on A100 GPUs which I think need cuda >= 11). Was anybody able to overcome this issue? Thanks

Hi, I'm also facing the same issue on 1080 when debugging a Libtorch segmentation model. The pytorch branch is v1.7.1.

CUDA10.1-cudnn7 CUDA11.1-cudnn8
187ms 320ms
commented

Hi, I'm also facing the same issue on 1080 when debugging a Libtorch segmentation model. The pytorch branch is v1.7.1.

CUDA10.1-cudnn7 CUDA11.1-cudnn8
187ms 320ms

The speed is really still slower when using CUDA 11, I don't know what makes.

Hi, I'm also facing the same issue on 1080 when debugging a Libtorch segmentation model. The pytorch branch is v1.7.1.
CUDA10.1-cudnn7 CUDA11.1-cudnn8
187ms 320ms

The speed is really still slower when using CUDA 11, I don't know what makes.

Have you tried to build your own PyTorch with cuda11.1(cuda11.2 is released but no cudnn support yet.) or use daily built PyTorch?
My colleague duplicated my comparision on RTX 2080Ti, but no difference is observed. I guess this is related to the device arch.

I am also getting a 2x slowdown with cuda 11 vs 10.2 on pytorch 1.7.1 on a GTX1080Ti.

Same problem on ubuntu 18.04 using titan rtx. Almost 2x speed up on batch size 5 when using conda and downgrading from
pytorch-1.7.1-py3.7_cuda11.0.221_cudnn8.0.5_0
to
pytorch-1.7.0-py3.7_cuda10.2.89_cudnn7.6.5_0.
Driver version is 460. Should i perhaps downgrade to 455? cudnn support matrix confuses me a bit.

edit: tested the 1.8 nightly, which came with cuda 11.0 and cudnn 8.0.3, and did not encounter speed issues.

Is it happening also with Cuda 11.2 (supported by cudnn 8.1.0 since January 26th)?

commented

Is it happening also with Cuda 11.2 (supported by cudnn 8.1.0 since January 26th)?

This is not supported by 1.8.0 until now, so I haven't tried it.

commented

Same problem on ubuntu 18.04 using titan rtx. Almost 2x speed up on batch size 5 when using conda and downgrading from
pytorch-1.7.1-py3.7_cuda11.0.221_cudnn8.0.5_0
to
pytorch-1.7.0-py3.7_cuda10.2.89_cudnn7.6.5_0.
Driver version is 460. Should i perhaps downgrade to 455? cudnn support matrix confuses me a bit.

edit: tested the 1.8 nightly, which came with cuda 11.0 and cudnn 8.0.3, and did not encounter speed issues.

I just try 1.8 with 11.1, still slow about 10%~15%.

1.8 with 11.1 about 40%~45% slower than 1.8 with 10.2.

If some of the benchmarks mentioned above are public, can someone post a concrete examples?
I've run simple builtin https://github.com/pytorch/pytorch/blob/master/benchmarks/fastrnns/bench.py and results looks pretty similar across 10.2 and 11.1 toolkits on single RTX 2080 GPU.
With CUDA-11.1:

Namespace(cnns=None, cuda_pointwise_block_count=None, cuda_pointwise_block_size=None, cuda_pointwise_loop_level=None, device='cuda', executor=None, fuser='te', group=['cnns', 'rnns'], hiddenSize=512, inputSize=512, miniBatch=64, nloops=100, numLayers=1, print_json=None, rnns=None, sep=' ', seqLength=100, variable_lstms=False, warmup=10)
Benchmarking LSTMs...
            name          avg_fwd          std_fwd         info_fwd          avg_bwd          std_bwd         info_bwd
           cudnn            7.056           0.1674             None            10.51          0.05994             None
            aten              9.4           0.1954             None            22.74            2.145             None
             jit            7.753          0.04015             None            22.83            2.266             None
      jit_premul            6.892          0.02701             None            21.78            1.189             None
 jit_premul_bias            7.118          0.03285             None            21.68           0.9731             None
      jit_simple            7.661          0.07742             None            22.97            1.975             None
  jit_multilayer            7.771           0.0539             None             23.4            2.204             None
              py            15.72           0.2226             None            27.62            3.523             None

Benchmarking ResNets...
            name          avg_fwd          std_fwd         info_fwd          avg_bwd          std_bwd         info_bwd
        resnet18            15.88          0.02621             None            33.65          0.06821             None
    resnet18_jit            15.93          0.03301             None            33.75            0.092             None
        resnet50            52.15          0.09712             None            110.7             0.31             None
    resnet50_jit            52.24          0.09196             None            111.2           0.2458             None

With CUDA-10.2:

Namespace(cnns=None, cuda_pointwise_block_count=None, cuda_pointwise_block_size=None, cuda_pointwise_loop_level=None, device='cuda', executor=None, fuser='te', group=['cnns', 'rnns'], hiddenSize=512, inputSize=512, miniBatch=64, nloops=100, numLayers=1, print_json=None, rnns=None, sep=' ', seqLength=100, variable_lstms=False, warmup=10)
Benchmarking LSTMs...
            name          avg_fwd          std_fwd         info_fwd          avg_bwd          std_bwd         info_bwd
           cudnn            6.972          0.08801             None            10.11          0.05425             None
            aten            8.679           0.3157             None             22.1            2.322             None
             jit            7.735           0.1454             None            23.48             3.46             None
      jit_premul            6.851          0.01975             None            21.04             1.09             None
 jit_premul_bias            7.082           0.0167             None            21.21            1.128             None
      jit_simple            7.708          0.03711             None            20.87            1.761             None
  jit_multilayer            7.813          0.04202             None            21.66              2.1             None
              py            13.14           0.4173             None            25.95            2.197             None

Benchmarking ResNets...
            name          avg_fwd          std_fwd         info_fwd          avg_bwd          std_bwd         info_bwd
        resnet18            16.96          0.03068             None            34.88          0.08062             None
    resnet18_jit            16.99          0.02914             None            34.96          0.07073             None
        resnet50            54.41          0.05765             None            113.8           0.1582             None
    resnet50_jit            54.63           0.1907             None            114.6           0.7369             None

@malfet Based on the reports (self included), it seems like NVIDIA GPUs which lack tensor cores are affected (or maybe it's just the 10-series).

I should have some time today to run those benchmarks, though. Did you use the default arguments?

@jmuchovej yes, just run it as python -m farstrnns.bench

commented

@jmuchovej Not yet. I have tried on TITAN V which has tensor cores, still slow. And some people use 2080Ti
also face the same question. The 10 series are slower than 20 series and titan v, but still slow. I think this may depends on the task. I just use yolov5 recently, so I only try it. I have research a little, it seems like more time spend on conv, may be the conv is diffirent?

commented

I don't keep my previous builds, so I don't have comparable benchmark results, but the situation for me with a RTX 2060 was like this:

I saw a huge performance boost for especially mixed training from pytorch 1.6.0-cuda 10.2 to pytorch 1.7.1-cuda 11.0. Normal training was the same. From pytorch 1.7.1-cuda 11.0 to pytorch 1.8.0-cuda 11.1, I've lost around 15-20% for both mixed and normal training.

These results are very similar on both Windows and Ubuntu.

@malfet I can't reproduce the slowdown with your benchmark. I am not sure why it doesn't show up.

But on my own repo I still see a 40% slowdown with pytorch 1.8 and cudatoolkit 11.1. It's mostly just a resnet with a double backward pass.

One epoch is ~1:20 with pytorch 1.8 and cudatoolkit 10.2, it's ~1:50 with cudatoolkit 11.1

This was tested on a GTX1080Ti. Everything installed through conda as described on pytorch.org.

@y0ast thank you for the link to repro, will look into this one

Same problem on 10 series GPU. Pytorch 1.8.1 with py3.9_cuda11.1_cudnn8_0 is around 30-40% slower than Pytorch 1.6.0 with py3.7_cuda102_cudnn7_0.

Same problem on 940M GPU

@y0ast there's a perf problem with double backward in cuda11/cudnn8 that we worked around in #54840, can you try cuda 11 nightlies and see if your perf is recovered?

Unfortunately this does not seem to be fixed:

PyTorch version + cudatoolkit version
1.8.1 + 10.2: 1m45s per epoch
1.8.1 + 11.1: 2m20s
1.9 + 11.1: 2m20s
1.9 + 10.2: 1m45s

All obtained on GTX 1080Ti, using this repo https://github.com/y0ast/deterministic-uncertainty-quantification (numbers are a bit different than my last comment due to counting an epoch different, but the relative difference remains)

PyTorch 1.9 for cudatoolkit 10.2 is 705MB, while it's 1.44GB for cudatoolkit 11.1. That seems like an unusually big difference.

@y0ast what is the command line to run this benchmark?

@ngimel On the master branch, just:

 python train_duq_cifar.py --final_model

It'll download cifar10 on the fly and prints a tqdm progress bar with epoch timing.

I created the environments for pytorch=1.9 like this:

conda install pytorch torchvision cudatoolkit=11.1 -c pytorch -c nvidia
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch -c nvidia

@klyjm I met the simliar problem in our new server with 8 x A100 GPU. The difference is that my model running slower only on DDP model and normal on DP mode. After my debugging, I found that the slower operation is loss.backward()

The environment I use is:
`Collecting environment information...
PyTorch version: 1.7.0+cu110
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: A100-PCIE-40GB
GPU 1: A100-PCIE-40GB
GPU 2: A100-PCIE-40GB
GPU 3: A100-PCIE-40GB
GPU 4: A100-PCIE-40GB
GPU 5: A100-PCIE-40GB
GPU 6: A100-PCIE-40GB
GPU 7: A100-PCIE-40GB

Nvidia driver version: 450.119.04
cuDNN version: Probably one of the following:
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.7.0+cu110
[pip3] torchaudio==0.7.0
[pip3] torchfile==0.1.0
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.8.1+cu110
[conda] numpy 1.19.5 pypi_0 pypi
[conda] torch 1.7.0+cu110 pypi_0 pypi
[conda] torchaudio 0.7.0 pypi_0 pypi
[conda] torchfile 0.1.0 pypi_0 pypi
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.8.1+cu110 pypi_0 pypi
`
It is about 3 times slower than Tesla V100 GPU( CUDA 10.2 , pytorch 1.6 ) when I run the same code.
A100 training takes 6.01 s,
V100 training takes 1.89 s.

According to the official document of NVIDIA, TFLOPs for A100 and V100 is 19 and 15 respectively, which means A100 should run faster than V100, I am really confused about the result.
Since CUDA10.x is not supported by A100 with Amperer Arch, I didn't test A100 with CUDA10.x

commented

@klyjm I met the same problem in our new server with 8 x A100 GPU, the environment I use is:
`Collecting environment information...
PyTorch version: 1.7.0+cu110
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: A100-PCIE-40GB
GPU 1: A100-PCIE-40GB
GPU 2: A100-PCIE-40GB
GPU 3: A100-PCIE-40GB
GPU 4: A100-PCIE-40GB
GPU 5: A100-PCIE-40GB
GPU 6: A100-PCIE-40GB
GPU 7: A100-PCIE-40GB

Nvidia driver version: 450.119.04
cuDNN version: Probably one of the following:
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.7.0+cu110
[pip3] torchaudio==0.7.0
[pip3] torchfile==0.1.0
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.8.1+cu110
[conda] numpy 1.19.5 pypi_0 pypi
[conda] torch 1.7.0+cu110 pypi_0 pypi
[conda] torchaudio 0.7.0 pypi_0 pypi
[conda] torchfile 0.1.0 pypi_0 pypi
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.8.1+cu110 pypi_0 pypi
`
It is about 3 times slower than Tesla V100 GPU( CUDA 10.2 , pytorch 1.6 ) when I run the same code.
A100 training takes 6.01 s,
V100 training takes 1.89 s.

According to the official document of NVIDIA, TFLOPs for A100 and V100 is 19 and 15 respectively, which means A100 should run faster than V100, I am really confused about the result.
Since CUDA10.x is not supported by A100 with Amperer Arch, I didn't test A100 with CUDA10.x

@graycrown Yeah, and I have test PyTorch 1.9, there is no different. This bug is still unsolved.

It seems to me this is related to cudnn.
I'm using pytorch 1.8.1 py3.8_cuda11.1_cudnn8.0.8_0 managed by Conda. I tried to benchmark training performance on GTX 1080 ti, RTX 2080, and RTX 3090.
When I have torch.backends.cudnn.enabled=False, the training performance went up by around 30%. The result is pretty consistent across all these three GPUs.

@maxwxzheng can you try updating to PyTorch-1.9 and compare the performance? (Several linking issues that could have negatively affect the CuDNN performance were fixed in 1.9)

commented

@maxwxzheng can you try updating to PyTorch-1.9 and compare the performance? (Several linking issues that could have negatively affect the CuDNN performance were fixed in 1.9)

@malfet I have tested my code in PyTorch 1.9, there is no difference. The PyTorch with cuda 11.1 + cudnn 8.x is still slower than cuda 10.2 + cudnn 7.x

I tested with pytorch 1.9 on RTX 3090. No difference as well. Training with cudnn on is still about 30 to 40 % slower than cudnn off.

@klyjm, @maxwxzheng what benchmark are you running?
I can not reproduce perf difference on GTX2080 by running python train_duq_cifar.py --final_model, it returns 2m4s per iteration regardless of whether CUDA-10.2 or CUDA-11.1 PyTorch binary is used

commented

@malfet Just like shown in the begin, I use test.py in ultralytics/yolov5 to test the speed. After the upgrade to cuda 11 and cudnn 8, the speed is always slower than use cuda 10 and cudnn 7

@klyjm ok, I can observe perf degradation between CUDA-11 and CUDA-10.2 for batchsize 1, but if batch size is larger, trend is inverse:

CUDA ver BatchSize Inference time
10.2 1 4.5ms
11.1 1 9.2ms
10.2 64 4.3ms
11.1 64 3.9ms
commented

@malfet Yeah, you are right. The bigger batch size, the smaller difference. And I also find that set torch.backends.cudnn.enabled=False can really acclerate the code with cuda 11.1, but will slower with cuda 10.2

I did some benchmark across different pytorch versions.
My network architecture is efficientnetb0 + fpn. I used this library: https://github.com/qubvel/segmentation_models.pytorch
It seems training speed with cudnn on has been decreasing for this architecture. (Mem is peak memory usage. Time is total time spent on training.)
Screen Shot 2021-07-18 at 9 59 55 PM

cc @ptrblck, can you reproduce these results?

I couldn't find any examples how to use the posted repository, but since it seems to reuse the timm repo, I just used efficientnet_b0 from it directly and couldn't reproduce the issue:

source build, cudnn8.2.2
cudnn.enabled=False, cudnn.benchmark=False, 20.43099s
cudnn.enabled=True, cudnn.benchmark=False, 17.86255s
cudnn.enabled=True, cudnn.benchmark=True, 17.97606s


1.9.0+cu111, cudnn8.0.5
cudnn.enabled=False, cudnn.benchmark=False, 21.61472s
cudnn.enabled=True, cudnn.benchmark=False, 19.17168s
cudnn.enabled=True, cudnn.benchmark=True, 19.05530s

1.8.1+cu111, cudnn8.0.5
cudnn.enabled=False, cudnn.benchmark=False, 21.60732s
cudnn.enabled=True, cudnn.benchmark=False, 19.55659s
cudnn.enabled=True, cudnn.benchmark=True, 19.03868s

1.7.1+cu110, cudnn8.0.5
cudnn.enabled=False, cudnn.benchmark=False, 45.85837s
cudnn.enabled=True, cudnn.benchmark=False, 19.83326s
cudnn.enabled=True, cudnn.benchmark=True, 19.36150s

Code:

import torch
import torch.nn as nn
import time
import timm
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True

model = timm.create_model('efficientnet_b0').cuda()
x = torch.randn(12, 3, 960, 640, device='cuda')

# warmup
for _ in range(10):
    out = model(x)
    out.backward(torch.ones_like(out))
    
grad = torch.ones_like(out)
nb_iters = 100
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
    out = model(x)
    out.backward(grad)
torch.cuda.synchronize()
t1 = time.perf_counter()

print('cudnn.enabled={}, cudnn.benchmark={}, {:.5f}s'.format(
    torch.backends.cudnn.enabled, torch.backends.cudnn.benchmark, (t1 - t0)))

Thanks @ngimel and @ptrblck for looking into this.
I think maybe one of the two things are happening:

  1. I'm doing something stupid in my code.
    I tried to profile the training with with torch.profiler. I got the following result:
    Screenshot from 2021-07-30 17-54-30
    It looks like aten::cudnn_convolution_backward_input is taking a long time. Is this common? What would cause this step to take such a long time?
  2. Somehow when I install pytorch 1.8+, conda automatically downgrade torchvision to 0.2.2. Would this older version of torchvision cause this slowdown?

@ptrblck hello, the same issue on Ampere RTX 3090. Downgrade of speed in tensor loading in 2 times whith CUDA 11.3 + libtorch_cu113 ~6 sec, instead CUDA10.2.+libtorch cu102 ~3 sec.
There is a problem yet. is there a solution?

@AlexTitovWork could you describe a bit more what "speed in tensor loading" means?
Are you measuring the time to transfer the tensor from CPU to the GPU and are seeing a slowdown?

Hello @ptrblck ! I use simple test for upload data in to GPU tensor under docker container.

     int height =400; // color img size in px 
     int width = 400;
     torch::Tensor tensor_image = torch::zeros({1,height,width,3});
     tensor_image = tensor_image.pin_memory();
     torch::Tensor tensor_image = torch::from_blob(input.data, {1, input.rows, input.cols, 3}, torch::kByte);
     tensor_image = tensor_image.permute({0, 3, 1, 2});
     tensor_image = tensor_image.toType(torch::kFloat);
     tensor_image = tensor_image.div(255);
     tensor_image = tensor_image.to(torch::kCUDA);

I use same test for two GPU platform
Allocationg + data transfering:

  1. on RTX 2080 Ti under:
    CUDA Driver version/ Runtime version 11.2/ 10.2
    CUDA Capability Major/Minor version number 7.5
    Takes ~3 sec.
  2. on RTX 3090 under:
    CUDA Driver version/ Runtime version 11.4/ 11.3
    CUDA Capability Major/Minor version number 8.6
    Takes ~6 sec.
    This is not such a big code that in the first case it takes ~3 seconds to allocate memory and in the second a terrible ~6.
    Moreover, most of the time is taken by device initialization and memory allocation.

It seems to be a problem with the docker I'm using.

Yes, very slow while loading the model parameter and memory allodation for cuda 11.5 and torch 1.3.1. Testing on cuda 10.1, I hope it will work.

Hello! I found next information about loading and memory allocation at the first start Libtorch or pyTorch
From here https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#troubleshooting
Q: Why is cuDNN version 8.0 convolution API call much slower on the first call than subsequent calls?

A: Due to the library split, cuDNN version 8.0 API will only load the necessary kernels on the first API call that requires it. In previous versions, this load would have been observed in the first cuDNN API call that triggers CUDA context initialization, typically cudnnCreate(). In version 8.0, this is delayed until the first sub-library call that triggers CUDA context initialization. Users who desire to have CUDA context preloaded can call the new cudnnCnnInferVersionCheck() API (or its related cousins), which has the side effect of initializing a CUDA context. This will reduce the run time for all subsequent API calls.

Same observation for torch 1.8.2 on a 2080ti machine. The overall speed of one of my training job is about 20% slower with cu11 than cu10.2.

I also still observe this with pytorch 1.11.0

With https://github.com/y0ast/DUE

python train_due.py

~2 minutes on cudatoolkit 10.2, ~4 minutes per epoch on cudatoolkit 11.3

Reproduced on two different machines with a 1080Ti (driver 510.47) and Titan Xp (driver 510.68).

I've gone over @ptrblck's example to see where the difference comes from and just adding:

torch.backends.cudnn.benchmark = True

Makes my epoch go from 4:55 to 1:57 with newer CUDA/CuDNN versions on the codebase I linked above. This is the same as it was with CUDA 10.2 (and CuDNN 7+).

My hypothesis is that in CuDNN 8+ the default convolution algorithm changed. This change is probably fine for newer hardware, but runs badly on older hardware. By setting benchmark to true, CuDNN is forced to re-evaluate that choice and finds that the old choice is better.

Same here, @y0ast didn't change anything for me

Same here. pytorch==1.10.1+cu111 is 4 times faster than pytorch==1.10.1+cu113 on A100 machine with cudatoolkit=11.4.

"Same here" is unfortunately not actionable.
@namespace-Pt this issue also discussed the difference between CUDA10.2 vs. 11.0, while you are using 11.x releases, so please feel free to create a new issues providing a minimal, executable code snippet as well as your system information as asked in the bug template.