[Feature] Kernel Fusion of Layer Norm and GeLU

Question

[Feature] Kernel Fusion of Layer Norm and GeLU

xrsrke opened this issue 3 months ago · comments

Despite the optimization for GEMM operators in MegatronLM, we identify opportunities for further enhancement in other operators. For the attention part, we adopt FlashAttention-2 [16], which improves work partitioning between different thread blocks and warps. For LayerNorm and GeLU, we observe that they are composed of fine-grained kernels in previous implementations. By fusing these kernels together, we reduce the overhead associated with launching multiple kernels and aid in optimizing memory access patterns, thereby achieving better performance.

Referece: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 5