NVIDIA / TransformerEngine

TransformerEngine/transformer_engine/common/gemm/cublaslt_gemm.cu

Lines 119 to 135 in bfe21c3

    
           cublasComputeType_t gemm_compute_type = CUBLAS_COMPUTE_32F; 
        
           if (A_type == CUDA_R_32F && B_type == CUDA_R_32F && D_type == CUDA_R_32F) { 
        
             gemm_compute_type = CUBLAS_COMPUTE_32F_FAST_TF32; 
        
           } 
        
           // Create matrix descriptors. Not setting any extra attributes. 
        
           NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Adesc, A_type, 
        
                                                        transa == CUBLAS_OP_N ? m : k, 
        
                                                        transa == CUBLAS_OP_N ? k : m, 
        
                                                        lda)); 
        
           NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Bdesc, B_type, 
        
                                                        transb == CUBLAS_OP_N ? k : n, 
        
                                                        transb == CUBLAS_OP_N ? n : k, 
        
                                                        ldb)); 
        
           NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Ddesc, D_type, m, n, ldd)); 
        
           NVTE_CHECK_CUBLAS(cublasLtMatmulDescCreate(&operationDesc, gemm_compute_type, CUDA_R_32F));

The precision of FP16 is enough for {addition,multiplication} of FP8, even though FP8-acc-FP32 has the same FLOPS with FP8-acc-FP16 on H100.
. Specifying FP16 compute type is more precise, and the FLOPS of FP8-acc-FP16 may be boosted in future arch.

The above image is from https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper

Please keep in mind that with TensorCores the accumulator is used to store the result of not just a single multiplication and addition of 2 FP8 values, but of a long series of those multiplication and additions. Every element of such sum could have different magnitude and the precision of FP16 could be inadequate to get the precise output. In the internal experiments we saw convergence issues when using the lower precision accumulator.

Thanks for your detailed explanation.

	cublasComputeType_t gemm_compute_type = CUBLAS_COMPUTE_32F;
	if (A_type == CUDA_R_32F && B_type == CUDA_R_32F && D_type == CUDA_R_32F) {
	gemm_compute_type = CUBLAS_COMPUTE_32F_FAST_TF32;
	}

	// Create matrix descriptors. Not setting any extra attributes.
	NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Adesc, A_type,
	transa == CUBLAS_OP_N ? m : k,
	transa == CUBLAS_OP_N ? k : m,
	lda));
	NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Bdesc, B_type,
	transb == CUBLAS_OP_N ? k : n,
	transb == CUBLAS_OP_N ? n : k,
	ldb));
	NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Ddesc, D_type, m, n, ldd));

	NVTE_CHECK_CUBLAS(cublasLtMatmulDescCreate(&operationDesc, gemm_compute_type, CUDA_R_32F));

When A and B are fp8 tensors, the compute type could be `CUBLAS_COMPUTE_16F`