cupy / cupy

Description

We follow pybench to do the benchmark on AGX Orin.
And found CuPy runs slower than NumPy on SVD and Matrix Multiplication use cases:

Based on tegrastats log, Orin's GPU already reached 99% utilization when benchmarking.
Please help to check if this is expected. Why does matrix multiplication run slower on GPU compared to CPU?

We also tested the CuPy built-in profiler and got the similar perf results.
Based on nsys profiler, the bottleneck seems to be cutlass_80_tensorop_d884gemm_64x32_16x4_nn_align1.

Env
System: AGX Orin
Main memory: 64GB
Python: 3.8.10
NumPy: 1.22.0
CuPy: 12.3.0
CUDA Toolkit: 11.4 (JetPack 5.1.2)

pybench

-------------------------------------------------------------------------------------------- benchmark: 2 tests -------------------------------------------------------------------------------------------
Name (time in s)                                 Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_Matrix_Multiplication[shape0-numpy]     10.6928 (1.0)      10.7859 (1.0)      10.7254 (1.0)      0.0376 (3.39)     10.7174 (1.0)      0.0508 (5.99)          1;0  0.0932 (1.0)           5           1
test_Matrix_Multiplication[shape0-cupy]      28.2100 (2.64)     28.2370 (2.62)     28.2175 (2.63)     0.0111 (1.0)      28.2127 (2.63)     0.0085 (1.0)           1;1  0.0354 (0.38)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

cupyx.profiler

>>> compute_func = lambda data: data.dot(data)
>>> a = cp.random.random((10000, 10000))
>>> print(benchmark(compute_func, (a,), n_repeat=20))
<lambda>            :    CPU:   202.340 us   +/- 12.619 (min:   178.529 / max:   228.897) us     GPU-0: 28209409.863 us   +/- 3048.091 (min: 28202355.469 / max: 28213451.172) us

Nsys

[4/7] Executing 'cuda_api_sum' stats report
 
Time (%)  Total Time (ns)  Num Calls      Avg (ns)          Med (ns)      Min (ns)     Max (ns)       StdDev (ns)                 Name
--------  ---------------  ---------  ----------------  ----------------  --------  --------------  ----------------  ----------------------------
     99.3  169,242,487,328         12  14,103,540,610.7  14,101,246,144.0     2,080  28,211,460,672  14,728,950,471.4  cudaDeviceSynchronize
      0.6    1,057,134,336         10     105,713,433.6         136,240.0       576     687,304,352     218,899,008.4  cudaFree
      0.1      125,990,272          7      17,998,610.3         451,232.0     5,344      78,229,984      28,754,464.4  cudaMalloc
      0.0          592,256          1         592,256.0         592,256.0   592,256         592,256               0.0  cuModuleLoadData
      0.0          439,296          8          54,912.0          57,824.0    18,016          69,248          16,960.2  cudaLaunchKernel
      0.0          368,352      1,122             328.3             256.0       128          10,048             367.0  cuGetProcAddress
      0.0          212,096          1         212,096.0         212,096.0   212,096         212,096               0.0  cuModuleUnload
      0.0           47,232          7           6,747.4           6,848.0     2,720           9,632           2,400.6  cudaStreamIsCapturing_v10000
      0.0           44,192          1          44,192.0          44,192.0    44,192          44,192               0.0  cuLaunchKernel
      0.0           38,336         18           2,129.8             896.0       608          21,312           4,796.0  cudaEventDestroy
      0.0           30,944          1          30,944.0          30,944.0    30,944          30,944               0.0  cudaMemGetInfo
      0.0           26,848         18           1,491.6             672.0       640          12,416           2,760.3  cudaEventCreateWithFlags
      0.0            9,536          3           3,178.7           3,456.0     2,016           4,064           1,051.8  cuInit
 
[5/7] Executing 'cuda_gpu_kern_sum' stats report
 
Time (%)  Total Time (ns)  Instances      Avg (ns)          Med (ns)         Min (ns)        Max (ns)     StdDev (ns)                                                  Name                
 --------  ---------------  ---------  ----------------  ----------------  --------------  --------------  -----------  ----------------------------------------------------------------------------------------------------
    100.0  169,230,052,160          6  28,205,008,693.3  28,208,045,280.0  28,192,812,480  28,211,028,480  6,699,818.0  void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_nn_align1>(T1::Params)
      0.0        8,925,568          1       8,925,568.0       8,925,568.0       8,925,568       8,925,568          0.0  cupy_random_x_mod_1                                                 
      0.0        5,705,792          1       5,705,792.0       5,705,792.0       5,705,792       5,705,792          0.0  void gen_sequenced<curandStateXORWOW, double, int, &curand_uniform_double_noargs<curandStateXORWOW>…
      0.0          225,152          1         225,152.0         225,152.0         225,152         225,152          0.0  void generate_seed_pseudo<rng_config<curandStateXORWOW, (curandOrdering)101>>(unsigned long long, u…

To Reproduce

Set up AGX Orin with JetPack 5.1.2

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

$ pip3 install numba
$ pip3 install cupy
$ pip3 install numpy==1.22

$ git clone https://github.com/pentschev/pybench.git
$ cd  pybench/
$ pip install git+https://github.com/pentschev/pybench
$ rm -rf pybench/benchmarks/__pycache__/
$ pytest pybench/benchmarks/benchmark_array.py --benchmark-json=benchmark_result.json

It seems it's dot() being benchmarked:
https://github.com/pentschev/pybench/blob/89d65a6c418a1fee39d447bd11b8a999835b74a9/pybench/benchmarks/benchmark_array.py#L48

Internally, dot() calls cublasGemmEx():

cupy/cupy/_core/_routines_linalg.pyx

Line 763 in e2d0b98

cublas.gemmEx(

and the cutlass kernel appeared in your profiling was the implementation that cuBLAS dispatched to, so there is nothing that CuPy can influence.

@AastaLLL The reason for seemingly suboptimal perf in your benchmarks is because fp64 was used. The Ampere chip in Jetson AGX Orin has much worse perf for fp64 than for fp32. Could you run the benchmarks with fp32 and see what you get?

@leofang

Using fp32, the benchmark result looks much better:

Thanks.

Thanks, @AastaLLL. One last question before closing: Could you run the same snippet that you had, but use fp32 this time?

>>> compute_func = lambda data: data.dot(data)
>>> a = cp.random.random((10000, 10000), dtype=cp.float32)
>>> print(benchmark(compute_func, (a,), n_repeat=20))

Hi, @leofang

Sure, below is the output of CuPy profiler

>>> compute_func = lambda data: data.dot(data)
>>> a = cp.random.random((10000, 10000), dtype=cp.float32)
>>> print(benchmark(compute_func, (a,), n_repeat=20))
<lambda>            :    CPU:    64.109 us   +/-  7.149 (min:    57.728 / max:    92.289) us     GPU-0: 516015.091 us   +/- 83.409 (min: 515840.271 / max: 516146.912) us

Here are the pybench results for reference:

------------------------------------------------------------------------------------------------------ benchmark: 2 tests -----------------------------------------------------------------------------------------------------
Name (time in us)                                   Min                   Max                  Mean              StdDev                Median                 IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_Matrix_Multiplication[shape0-cupy]        740.6450 (1.0)      1,291.3370 (1.0)        929.0434 (1.0)      263.1012 (1.0)        742.5340 (1.0)      428.6752 (1.04)          1;0  1,076.3760 (1.0)           5           1
test_Matrix_Multiplication[shape0-numpy]     5,887.4010 (7.95)     6,846.0960 (5.30)     6,133.2588 (6.60)     410.8201 (1.56)     5,906.3130 (7.95)     411.3472 (1.0)           1;0    163.0455 (0.15)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
===================================================================================== 2 passed in 1.66s =====================================================================================

Thanks.

Thanks, @AastaLLL. Let me close the issue.

(Also thanks the cuBLAS team for helping triage.)

CuPy matrix multiplication is 2.6x slower than NumPy on Jetson AGX Orin

Description

To Reproduce