performance optimizations on CUDA 12.9?
crischeng opened this issue · comments
What performance optimizations were added in CUDA 12.9?
I run test_fp8.py in the H20-96G using CUDA 12.8 and CUDA 12.9, the performance was consistent.
DeepGemm commit f85ec64
cuda 12.8:

cuda12.9:

NVCC in CUDA 12.9 has an optimization for better tensor core/CUDA core overlapping. This will affect H800 (1979 peak TFLOPS) devices more. I guess as H20 has much lower peak TFLOPS, so it has no effect.