[ERROR] cuBLAS error when launch training with Megatron-LM and TransformerEngine

Question

[ERROR] cuBLAS error when launch training with Megatron-LM and TransformerEngine

Btlmd opened this issue a month ago · comments

Hi,

I am using Megatron-LM with TransformerEngine to launch LM training. I encountered the following issue when the dp world size is not rounded enough, like 30.

RuntimeError: TransformerEngine/transformer_engine/common/gemm/cublaslt_gemm.cu:326 in function cublas_gemm: cuBLAS Error: the requested functionality is not supported

This error is related to #845. However, after fixing the alignment issue following #845, the above error is solved but we encountered another error, where

all nodes with some tensors misaligned to 256 (A is aligned to 4, 8 and 16) stuck with no error reported
some of the nodes whose addresses are aligned to 256 stuck with no error
some of the nodes whose addresses are aligned to 256 stuck with torch distributed error 'Connection reset by peer'

The strange thing is that the error is only reported on some of the nodes where all it addresses are aligned to 256. We reproduced this error on nvcr.io/nvidia/nemo:24.03.01.framework and nvcr.io/nvidia/nemo:24.01.01.framework

I'm not sure where to start further debugging. I would be grateful if anyone could offer some help.

Mingdao Liu · Answer 1 · Wed May 15 2024 18:30:22 GMT+0800 (China Standard Time)

Do you have any idea concerning this error? @phu0ngng

Mingdao Liu · Answer 2 · Wed May 15 2024 20:44:21 GMT+0800 (China Standard Time)

The error is fixed by NVIDIA/Megatron-LM@c3677e0