rayleizhu / awsome-cuda-attention

Collection of effcient implementations of attention with CUDA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

awsome-cuda-attention

Collection of effcient implementations of attention with CUDA.

Flash Attention

These two papers describe how to implement Exact attention while avoid O(n^2) memory occupied by intermediate attenion matrix (i.e. QK^T).

Implementations can be found at:

Block Sparse Attention

Sparse GEMMs are not friendly with GPUs due to poor spatial-temporal locality. But stuctured block sparsity solves this problem. See OpenAI's blog for details.

Reference

About

Collection of effcient implementations of attention with CUDA