TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add flash attention based on b2b GEMM

KuangjuX opened this issue · comments

I would suggest prepare the implementation of the flash attention algorithm (I prefer calling it an parallel algorithm).

I think flash attention has many implications for thinking about scheduling an efficient DNN computational process. Since it involves a combination of several elements, including the reuse of TCU's output at the register level, warp reduction, element-wise operations, and the arrangement of warps, among others.

Preparing the implementation first will allow us to observe how to organize the structure of the computational process.