NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Home Page:https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

When A and B are fp8 tensors, the compute type could be `CUBLAS_COMPUTE_16F`

condy0919 opened this issue · comments

cublasComputeType_t gemm_compute_type = CUBLAS_COMPUTE_32F;
if (A_type == CUDA_R_32F && B_type == CUDA_R_32F && D_type == CUDA_R_32F) {
gemm_compute_type = CUBLAS_COMPUTE_32F_FAST_TF32;
}
// Create matrix descriptors. Not setting any extra attributes.
NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Adesc, A_type,
transa == CUBLAS_OP_N ? m : k,
transa == CUBLAS_OP_N ? k : m,
lda));
NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Bdesc, B_type,
transb == CUBLAS_OP_N ? k : n,
transb == CUBLAS_OP_N ? n : k,
ldb));
NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Ddesc, D_type, m, n, ldd));
NVTE_CHECK_CUBLAS(cublasLtMatmulDescCreate(&operationDesc, gemm_compute_type, CUDA_R_32F));

The precision of FP16 is enough for {addition,multiplication} of FP8, even though FP8-acc-FP32 has the same FLOPS with FP8-acc-FP16 on H100.
image. Specifying FP16 compute type is more precise, and the FLOPS of FP8-acc-FP16 may be boosted in future arch.

The above image is from https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper

Please keep in mind that with TensorCores the accumulator is used to store the result of not just a single multiplication and addition of 2 FP8 values, but of a long series of those multiplication and additions. Every element of such sum could have different magnitude and the precision of FP16 could be inadequate to get the precise output. In the internal experiments we saw convergence issues when using the lower precision accumulator.

Thanks for your detailed explanation.