CuPy matrix multiplication is 2.6x slower than NumPy on Jetson AGX Orin
AastaLLL opened this issue · comments
Description
We follow pybench to do the benchmark on AGX Orin.
And found CuPy runs slower than NumPy on SVD and Matrix Multiplication use cases:
Based on tegrastats log, Orin's GPU already reached 99% utilization when benchmarking.
Please help to check if this is expected. Why does matrix multiplication run slower on GPU compared to CPU?
We also tested the CuPy built-in profiler and got the similar perf results.
Based on nsys profiler, the bottleneck seems to be cutlass_80_tensorop_d884gemm_64x32_16x4_nn_align1.
Env
System: AGX Orin
Main memory: 64GB
Python: 3.8.10
NumPy: 1.22.0
CuPy: 12.3.0
CUDA Toolkit: 11.4 (JetPack 5.1.2)
pybench
-------------------------------------------------------------------------------------------- benchmark: 2 tests -------------------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_Matrix_Multiplication[shape0-numpy] 10.6928 (1.0) 10.7859 (1.0) 10.7254 (1.0) 0.0376 (3.39) 10.7174 (1.0) 0.0508 (5.99) 1;0 0.0932 (1.0) 5 1
test_Matrix_Multiplication[shape0-cupy] 28.2100 (2.64) 28.2370 (2.62) 28.2175 (2.63) 0.0111 (1.0) 28.2127 (2.63) 0.0085 (1.0) 1;1 0.0354 (0.38) 5 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cupyx.profiler
>>> compute_func = lambda data: data.dot(data)
>>> a = cp.random.random((10000, 10000))
>>> print(benchmark(compute_func, (a,), n_repeat=20))
<lambda> : CPU: 202.340 us +/- 12.619 (min: 178.529 / max: 228.897) us GPU-0: 28209409.863 us +/- 3048.091 (min: 28202355.469 / max: 28213451.172) us
Nsys
[4/7] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ---------------- ---------------- -------- -------------- ---------------- ----------------------------
99.3 169,242,487,328 12 14,103,540,610.7 14,101,246,144.0 2,080 28,211,460,672 14,728,950,471.4 cudaDeviceSynchronize
0.6 1,057,134,336 10 105,713,433.6 136,240.0 576 687,304,352 218,899,008.4 cudaFree
0.1 125,990,272 7 17,998,610.3 451,232.0 5,344 78,229,984 28,754,464.4 cudaMalloc
0.0 592,256 1 592,256.0 592,256.0 592,256 592,256 0.0 cuModuleLoadData
0.0 439,296 8 54,912.0 57,824.0 18,016 69,248 16,960.2 cudaLaunchKernel
0.0 368,352 1,122 328.3 256.0 128 10,048 367.0 cuGetProcAddress
0.0 212,096 1 212,096.0 212,096.0 212,096 212,096 0.0 cuModuleUnload
0.0 47,232 7 6,747.4 6,848.0 2,720 9,632 2,400.6 cudaStreamIsCapturing_v10000
0.0 44,192 1 44,192.0 44,192.0 44,192 44,192 0.0 cuLaunchKernel
0.0 38,336 18 2,129.8 896.0 608 21,312 4,796.0 cudaEventDestroy
0.0 30,944 1 30,944.0 30,944.0 30,944 30,944 0.0 cudaMemGetInfo
0.0 26,848 18 1,491.6 672.0 640 12,416 2,760.3 cudaEventCreateWithFlags
0.0 9,536 3 3,178.7 3,456.0 2,016 4,064 1,051.8 cuInit
[5/7] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ---------------- ---------------- -------------- -------------- ----------- ----------------------------------------------------------------------------------------------------
100.0 169,230,052,160 6 28,205,008,693.3 28,208,045,280.0 28,192,812,480 28,211,028,480 6,699,818.0 void cutlass::Kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_nn_align1>(T1::Params)
0.0 8,925,568 1 8,925,568.0 8,925,568.0 8,925,568 8,925,568 0.0 cupy_random_x_mod_1
0.0 5,705,792 1 5,705,792.0 5,705,792.0 5,705,792 5,705,792 0.0 void gen_sequenced<curandStateXORWOW, double, int, &curand_uniform_double_noargs<curandStateXORWOW>…
0.0 225,152 1 225,152.0 225,152.0 225,152 225,152 0.0 void generate_seed_pseudo<rng_config<curandStateXORWOW, (curandOrdering)101>>(unsigned long long, u…
To Reproduce
Set up AGX Orin with JetPack 5.1.2
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
$ pip3 install numba
$ pip3 install cupy
$ pip3 install numpy==1.22
$ git clone https://github.com/pentschev/pybench.git
$ cd pybench/
$ pip install git+https://github.com/pentschev/pybench
$ rm -rf pybench/benchmarks/__pycache__/
$ pytest pybench/benchmarks/benchmark_array.py --benchmark-json=benchmark_result.json
It seems it's dot()
being benchmarked:
https://github.com/pentschev/pybench/blob/89d65a6c418a1fee39d447bd11b8a999835b74a9/pybench/benchmarks/benchmark_array.py#L48
Internally, dot()
calls cublasGemmEx()
:
cupy/cupy/_core/_routines_linalg.pyx
Line 763 in e2d0b98
and the cutlass kernel appeared in your profiling was the implementation that cuBLAS dispatched to, so there is nothing that CuPy can influence.
@AastaLLL The reason for seemingly suboptimal perf in your benchmarks is because fp64 was used. The Ampere chip in Jetson AGX Orin has much worse perf for fp64 than for fp32. Could you run the benchmarks with fp32 and see what you get?
Thanks, @AastaLLL. One last question before closing: Could you run the same snippet that you had, but use fp32 this time?
>>> compute_func = lambda data: data.dot(data)
>>> a = cp.random.random((10000, 10000), dtype=cp.float32)
>>> print(benchmark(compute_func, (a,), n_repeat=20))
Hi, @leofang
Sure, below is the output of CuPy profiler
>>> compute_func = lambda data: data.dot(data)
>>> a = cp.random.random((10000, 10000), dtype=cp.float32)
>>> print(benchmark(compute_func, (a,), n_repeat=20))
<lambda> : CPU: 64.109 us +/- 7.149 (min: 57.728 / max: 92.289) us GPU-0: 516015.091 us +/- 83.409 (min: 515840.271 / max: 516146.912) us
Here are the pybench results for reference:
------------------------------------------------------------------------------------------------------ benchmark: 2 tests -----------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_Matrix_Multiplication[shape0-cupy] 740.6450 (1.0) 1,291.3370 (1.0) 929.0434 (1.0) 263.1012 (1.0) 742.5340 (1.0) 428.6752 (1.04) 1;0 1,076.3760 (1.0) 5 1
test_Matrix_Multiplication[shape0-numpy] 5,887.4010 (7.95) 6,846.0960 (5.30) 6,133.2588 (6.60) 410.8201 (1.56) 5,906.3130 (7.95) 411.3472 (1.0) 1;0 163.0455 (0.15) 5 1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
===================================================================================== 2 passed in 1.66s =====================================================================================
Thanks.