huggingface / nanotron

Minimalistic large language model 3D-parallelism training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature] Kernel Fusion of Layer Norm and GeLU

xrsrke opened this issue · comments

Despite the optimization for GEMM operators in MegatronLM, we identify opportunities for further enhancement in other operators. For the attention part, we adopt FlashAttention-2 [16], which improves work partitioning between different thread blocks and warps. For LayerNorm and GeLU, we observe that they are composed of fine-grained kernels in previous implementations. By fusing these kernels together, we reduce the overhead associated with launching multiple kernels and aid in optimizing memory access patterns, thereby achieving better performance.

Referece: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 5